Using eye movements, electrodermal activities, and heart rates to predict different types of cognitive load during reading with background music

The experiment was approved by The University of Hong Kong’s Human Research Ethics Committee (reference number: EA1802092). All methods were carried out in compliance with the American Psychological Association ethical standards. All participants gave their informed consent prior to their involvement in the user experiment.

Table of Contents

Participants recruitment requirement

Participants should be English as a Second Language (ESL) learners aged between 18 and 35, without any visual, hearing, or learning impairments. A priori power analysis was conducted using G*Power³⁵, informed by empirical effect sizes reported in DeLeeuw and Mayer⁴, as well as guidelines referenced in Zu et al.¹. The analysis considered anticipated effects across different types of cognitive load (extraneous, intrinsic, and germane). Based on these estimations, a sample size of up to 94 participants was considered sufficient to detect the effect sizes reported in prior literature⁴ with 80% power at α = .05. Our final sample included 102 participants, which exceeded this threshold. Detailed sample size calculations are provided in the Appendix in the supplementary material. Generally, eye movement studies analyzing the effects of manipulated cognitive factors on eye-tracking measures consider a sample size exceeding 50 participants to be large¹.

Procedure

As Fig. 1 shows, participants first filled in a pre-survey that gathered their demographic information (e.g., gender, age, major). They then took the LexTale test, which measures general English proficiency. Stabilization on the Empatica E4 wristband was administered to ensure the quality of peripheral physiological signals. After that, participants were asked to sit at total rest for 2-min to record the baselines of peripheral physiological signals. The facilitator guided participants through a practice block to familiarize them with the formal experiment.

The main task required participants to read all the passages either with BGM (for those in the experimental group) or in silence (for those in the control group), during which the movements of their dominant eye were tracked and their peripheral physiological signals were recorded. The nine passages were evenly assigned into three blocks. The order of the blocks and passages was counter-balanced with regard to difficulty levels (see below) across participants. To ensure the accuracy of eye tracking, eye-tracker calibration was performed before each block, and drift correction was conducted before each passage. After finishing a passage, the participants were instructed to press the “continue” button, and answered two cognitive load questions regarding perceived difficulty level and extent of understanding of the passage in sequence, followed by two reading comprehension questions on the content of the passages. Participants were given unlimited time for reading and answering questions, which was to simulate a typical reading context instead of examinations. After each block, participants had a 2-min rest to relieve potential tiredness. After three blocks, they then completed the two-back test, which measures WMC.

Reading task

The reading task was composed of nine English passages which were selected from the reading comprehension samples from Graduate Record Examinations (GRE) with a variety of general topics. The passages were divided into three levels of text difficulty determined by a widely used text readability score, Flesch-Kincaid grade³⁶. The higher the grade, the harder it is to read a text. We calculated this metric using the online software readable.io. Means and standard deviations (in parenthesis) of the grade of passages in each level were: 13.3 (0.35) for easy, 17.4 (0.46) and 21.0 for hard (0.47). The easy-level passages covered themes of astronomy, astrobiology, and physics; the medium-level passages covered themes of archaeology, history, and literature; the hard-level passages covered themes of biology, sociology, and anthropology.

The passages covered a diversified scope of themes including astronomy, astrobiology, physics, archaeology, history, literature, biology, sociology, and anthropology, aimed at mitigating potential biases arising from learners’ varied knowledge backgrounds. All passages expressed complete ideas or meanings and contained a similar number of words, with an average word count of 218 and a standard deviation of 9.8. For each passage, we designed one text-based question, which was conceptually simple and only required shallow understanding of the content, and one inference-based question, which required reasoning and deep understanding of the content^10,37. Each question had four response alternatives. Accuracy in answering the text-based and inference questions were assessed.

Self-reported cognitive load questions

The questions were used to evaluate students’ self-perceived difficulty level and extent of understanding of the passage. For self-perceived difficulty of the passage, we asked “How difficult do you think the passage is?”, which was adapted from the NASA Task Load Index (TLX) Mental Demand question²². For self-perceived understanding of the passage, we asked “To what extent do you understand the passage?”, which was adapted from the TLX Performance question²². Each question was rated and recorded on a 5-point Likert scale, with 1 being the lowest rating (i.e., very easy, not understanding at all) and 5 being the highest rating (i.e., very difficult, understanding very well).

LexTale test

English proficiency was assessed by the LexTale test that could measure participants’ familiarity level to English words and represent their general English proficiency³⁸. The test contained 60 trials and in each trial, a string of letters was presented. Participants judged if the string was an existent English word or not. English proficiency was computed as the percentage of correct responses in the LexTale test.

Two-back test

WMC was examined by two-back tests³⁹. In the tests, a continuous stream of single letters was shown at different locations one at a time. Participants determined whether the current letter (in the verbal subtask) or the current location (in the spatial subtask) was the same as the one presented two trials earlier. Each subtask consisted of 36 trials, each letter lasting 1000 millisecond (ms) followed by a 2500 ms blank screen. WMC was measured by average accuracy of all trials across verbal and spatial subtasks.

Apparatus

EyeLink 1000 plus (tower mount model) was used to record participants’ eye movements during reading the passages. The eye tracker’s sampling rate was set as 2000 Hz, and resolution of the computer monitor (19 inches) was 1280 * 1024 pixels for displaying reading passages. The viewing distance was 56 cm, and horizontal visual angle for each English character was around 0.3°, which simulated a normal reading situation⁴⁰. A research-grade wristband, Empatica E4 was worn by the participants to record peripheral physiological signals including Heart Rate (HR), Inter-beat Intervals (IBI), Electrodermal Activity (EDA), etc. All recorded data were synchronized by timestamps and were anonymized for keeping confidentiality.

Design for examining three types of cognitive load

We examined three types of cognitive load following the experimental design adopted in previous studies^1,4.

Extraneous load was created based on the coherence principle which proposed that irrelevant information should be avoided in multimedia presentations¹⁶. According to the coherence principle, irrelevant sound such as BGM is not suggested to be presented along with the text because it would require cognitive resources to either attend to or to inhibit the irrelevant information¹⁶. In the present study, participants were randomly assigned to either BGM or silence condition. In the BGM condition, passages were presented with the accompaniment of participants’ self-provided preferred BGM. Since participants needed to either process or inhibit the irrelevant sound while reading, listening to BGM was supposed to increase their extraneous cognitive load, as reflected in the impaired performance on reading comprehension tests in previous studies^13,14. The background audio condition (i.e., BGM vs. silence) was thereby a between-subject factor.

Intrinsic load depends on the complexity of the learning material and was created with text complexity determined by Flesch-Kincaid Grade³⁶ (as described in Reading Task). Based on the grades, passages were evenly coded into Easy, Medium, and Hard levels. The level of passage complexity was therefore a within-subject factor in the experiment.

Germane load is defined as the cognitive resources that are available to process the intrinsic load^1,4. It is linked to effective learning and is indicative of deep cognitive processing, as demonstrated by activities such as mentally organizing learning materials. This type of cognitive load is instrumental in enhancing learners’ performance¹⁸. In the present study, germane load was reflected using reading comprehension test scores. This assumes that those who comprehend materials better would have invested more cognitive resources to process the material more elaborately, namely, to have experienced higher germane load, and those who experienced high or low germane load during the reading phase would score higher or lower in the reading comprehension tests. Similar to DeLeeuw and Mayer’s study on multimedia lessons⁴, we divided participants into low and high germane load groups using a mean split on the passage’s reading comprehension score. That is, the participants who had a better-than-average score were designated as having experienced high germane load, and those who had a lower-than-average score were designated as having experienced low germane load. The reading comprehension test score was thus a between-subject factor.

Proposed measures

This study collected questionnaire, EM, EDA, and HR/HRV data during the reading task.

Questionnaire

Self-reported measures were evaluated by two subjective questions which asked the participants’ perceived difficulty level, and understanding degree of the passage (see the exact questions in Self-reported Cognitive Load Questions). The ratings on the two scales were analyzed separately, each of which was equal to the averaged value across passages.

Eye movement metrics

Table 1 shows eye movement (EM) features we used to infer readers’ cognitive processes, including mean fixation duration, mean saccade amplitude, fixation count, and regression count²⁶. Longer and more fixations can indicate a task that is more cognitively demanding, and the individual is experiencing cognitive overload⁴¹. Saccade amplitude indicates the length of rapid movements between two fixations. Individuals with reading problems often exhibit shorter saccade amplitudes²⁶, and a reading task with high cognitive demand would result in shorter saccade amplitude²⁷. Regressive EM behaviours involve moving backwards to the word previously being fixated on. A cognitively demanding reading task would typically cause increased regressions^27,42. Mean Fixation Duration and Mean Saccade Amplitude are measures at word-level, aggregated as mean values across all words within a passage. Fixation Count and Regression Count are measures at passage-level, aggregated as sum values per passage.

Table 1 Eye movement measures and their descriptions.

Electrodermal activity, heart rate, and heart rate variability metrics

Table 2 shows peripheral physiological measures and their descriptions. Electrodermal Activity (EDA) refers to the signals produced by sweat on the skin. Skin Conductance Responses (SCRs) refer to the phasic activity of EDA signals, which fluctuate rapidly. SCRs are innervated by sudomotor nerves of the sympathetic nervous system, firing in response to emotional and stressful stimuli²⁹. Therefore, fluctuations in SCR are thought to be salient indicators of changes in the extent of stress (e.g., cognitive load) induced by a stimulus or event⁴³. Similar to prior studies^31,44, we calculated SCR Frequency (i.e., the total number of the detected SCR peak per second) and SCR Amplitude (i.e., the mean of the amplitude of the detected SCR peak) as potential indicators of cognitive load.

Table 2 Peripheral physiological measures and their descriptions.

Heart Rate (HR) and Heart Rate Variability (HRV) are two main cardiac response constructs commonly studied in cognitive load research. As cognitive demand increases, HR rises while HRV decreases³³. This is because during psychophysiologic arousal state, heart rhythm becomes faster and more uniform. In this study, we chose the mean HR (beats per minute), and one widely used and psychologically meaningful HRV feature, RMSSD (i.e., the root mean square of successive differences between normal heartbeats). RMSSD computes the differences between adjacent heartbeats in milliseconds, squares these values, and the result is averaged before taking the square root of the total; it can reflect larger changes from one beat to the next^45,46.

Preprocessing eye movements and peripheral physiological signals

Automatic parser of Eyelink 1000 plus with default settings for cognitive research was used to identify saccades and fixations: if an eye movement had an instantaneous velocity greater than 30°/sec or an acceleration greater than 8000°/s², it was categorized as a saccade, and the remaining data points between successive saccades were classified as fixations. The Eyelink DataViewer was then used to generate a series of eye movement data. Outlying eye movements, such as those beyond the image stimuli and those caused by drift correction, were removed.

The difference in sampling rates were considered (EDA: 4Hz; HR: 1Hz) when pre-processing different kinds of peripheral physiological signals. All the signals were segmented based on the start timestamp and end timestamp of the period spent in reading each passage or the 2-min rest period at the beginning of the experiment that measured the baseline of the peripheral physiological signals. Features were further derived from the segments according to the corresponding feature extraction methods as follows.

Analysis of raw EDA signals was performed with neurokit2, a Python-based toolkit, that can detect peaks of the EDA signals, namely the Skin Conductance Response (SCR) events⁴⁷. In pre-processing, we first filtered noises in the EDA signals through a unidirectional first-order Butterworth low pass filter with a cut-off frequency of 0.05 Hz⁴⁸ and then decomposed the signals into the phasic and tonic component by using the cvxEDA method²⁸ which was provided by the eda_phasic function of neurokit2 (with a sampling rate of 4 Hz). After that, we detected SCR peaks from the phasic component with the eda_peak function before computing frequency and amplitude of SCR.

We used hrvanalysis, a Python package for heart rate variability analysis⁴⁹ to compute the HRV RMSSD feature from the IBI data provided by Empatica E4 (see how to compute it in the Electrodermal Activity, Heart Rate, and Heart Rate Variability Metrics section). In pre-processing the signals, we first adopted hrvanalysis package to detect outliers from RR intervals above 2000 or below 300 milliseconds, as well as identify ectopic beats via the Malik method: Intervals deviating by greater than 20% from the preceding interval were regarded as outliers and replaced with linear interpolation⁴⁵. Moreover, we computed the mean HR based on the HR data directly provided by Empatica E4.

Data analysis

To minimize the potential impact of individual differences across the two background audio conditions (i.e., BGM vs. silence), we applied independent sample t-tests to compare two learner characteristics, namely English proficiency (measured by LexTale test scores) and WMC (measured by two-back test accuracy), as well as the baseline measurements of the peripheral physiological signals from 2-min resting periods between the two conditions.

Before answering the RQ, we performed Mann-Whitney U tests (for ordinal variables) to check if there was any difference in each self-reported measure between various conditions (i.e., BGM vs. silence or low vs. high comprehension accuracy, based on between-subject design). Moreover, one-way repeated measures ANOVA was conducted to examine if there was any difference in each self-reported measure across intrinsic load conditions (i.e., easy vs. medium vs. hard text complexity, based on within-subject design). Mauchly’s test was used to assess data sphericity. When data violated sphericity, degrees of freedom were corrected using Huynh-Feldt (estimated epsilon ε > 0.75) or Greenhouse-Geisser correction (ε < 0.75)⁵⁰. Significant main effects were followed with Bonferroni Post-hoc for three of the paired samples comparison (corrected p = 0.016).

To answer the RQ, we constructed binary logistic regression models (method: stepwise) to predict cognitive load types (i.e., conditions of extraneous, intrinsic, germane load serve as dependent variables) by simultaneously including multimodal sensor data and learner characteristics as independent variables. Measures of multimodal sensor data were extracted from EM, EDA, and HR/HRV signals that were collected when participants were reading passages. Learner characteristics included WMC and general English proficiency. Only statistically significant models with significant predictors were reported. We performed a series of likelihood ratio tests to evaluate the overall goodness of fit of the logistic regression models, denoted by chi-square statistics, and the associated p-values, complemented by R² measures⁵¹. In particular, the Nagelkerke’s R² was used to represent how well the independent variables explain the variance in the dependent variable in a logistic regression model.

link