Deep learning to quantify care manipulation activities in neonatal intensive care units
Study design and participants
To develop and evaluate our method, a multi-modal dataset of neonates undergoing routine care procedures in NICUs was collected and considered. The dataset, comprised of video recordings and their associated physiological data, subsumed data points used in Singh et al.19 and included a substantial extension (an additional 218 h of recordings).
The NEO device with a camera module32 was used to collect the video and physiological data. A NEO device was installed on each bed and recorded videos from the camera module, as well as heart rates and SpO2 levels from existing monitoring devices in the NICU. The camera module consisted of a Logitech C920 mounted on a flexible tube, allowing for the adjustment of camera position. The study was designed to minimally interfere with the clinical workflow. The only instruction provided to the nursing staff and other attendants was to ensure that the camera was oriented towards the neonate during their routine care procedures. The resulting videos captured activities from various lighting conditions (e.g., different time of a day) and viewing angles (e.g., different positioning of the camera with respect to the neonate). The data recording setup and sample video frames are shown in Fig. 7.
All data were collected from a level III 22 bed urban NICU and level II-b 17 bed rural NICU in India. The typical staff in the NICUs consisted of 3 neonatologists with doctorate degrees in neonatal sciences, 3-4 medical residents, and 18-20 nurses. 27 neonates were included in the study with an average age of 33.0 weeks. The Institutional Review Board from the Apollo Cradle & Children hospital (Moti Nagar, Delhi, India) and from the Kalawati Hospital (Rewari, India) approved this study and waived the requirement for informed consent. The electronic health records of the neonates were de-identified in accordance with Health Insurance Portability and Accountability Act regulations, and the research was conducted in compliance with relevant guidelines. For figures included in this paper and its additional information that involve images of neonates with blurred faces, written consents were obtained from the parents of eligible neonates.
Data statistics and annotation
Video recordings were captured at a resolution of 1280 × 720 pixels and a rate of 30 frames per second. Physiological data were recorded at various rates ranging from 1 reading per minute to 1 reading per second, depending on the monitoring device in the NICU units. A manual inspection was performed to select videos in which the neonate is clearly visible and at least of one manipulation activity is observed, leading to a dataset of 330 videos (average length 52.5 min) with their associated physiological data that last 288.8 h in total. Two trained annotators were further tasked to independently mark the start and end times for every instance of diaper change, feeding, and patting events in the video. A further verification was performed to address the discrepancy in the annotations. A total of 650 care manipulation activities were identified in the video dataset. Due to issues in synchronization and data corruption, 479 out of these 650 activities (corresponding to 19 out of the total 27 neonates) had physiological data synchronized with the video. Details of the dataset are summarized in Table 1.
Deep learning for video analysis
Central to our approach lies in a deep learning method for temporal activity localization — the ability to recognize the occurrence of care manipulation activities in a video, and localize their corresponding onsets and offsets. Our deep learning approach combines methods for video representation learning and for temporal activity localization. A flow chart of our method is illustrated in Fig. 8.
To represent an input video, a convolutional neural network (SlowFast33) pre-trained on a large-scale Internet video dataset (Kinetics34) was considered to extract video features. SlowFast is well suited for analyzing NICU videos; it has been widely adopted for video understanding, and demonstrates reliable efficacy across a diverse range of tasks, including the recognition of human activities33, the monitoring of animal behaviors35, and the analysis of driving scenarios36. To bridge the gap between NICU videos and the Internet videos used for pre-training, a transfer learning approach was further employed to adapt the pre-trained model for NICU videos. Specifically, the SlowFast model was fine-tuned using labeled video clips of care manipulation activities sampled from the videos in the training set during cross-validation. Each clip spanned 2.67 s containing 32 frames sampled at 12 Hz, with each frame randomly cropped to a size of 224 × 224 from a resolution of 512 × 288 (width × height). This fine-tuning step was found helpful to improve the performance of activity detection, as evaluated in Supplementary Table 4. The fine-tuned model was then used to extract clip-level video features for temporal activity localization. Video features were extracted from overlapping clips, each spanning 2.67 s (32 frames sampled at 12 Hz at a resolution of 512 × 288) with a temporal interval of 1.33 s.
To detect activities in the video, a latest method ActionFormer previously developed by our group37 was adapted. ActionFormer leverages a Transformer-based model38, offers an open-sourced tool for temporal activity localization, and has demonstrated state-of-the-art results on major benchmarks37. ActionFormer is deemed appropriate for detecting NICU care manipulation activities due to two main reasons. First, it has been widely adopted to analyze various human activities, ranging from sports and activities of daily living37 to nursing procedures39, attesting to its versatility and effectiveness. Secondly, it incorporates local self-attention mechanism, and thus supports the efficient processing of hour-long videos, which is common in the NICU. To further justify our method, Supplementary Table 4 compares the results of ActionFormer on NICU videos to those from latest methods designed for temporal activity localization. Technically, the clip-level video features were input to ActionFormer, in which every moment of the video was examined and classified as either one of the care manipulation activities or the background. If a moment was recognized as manipulation activity, the temporal boundary of the activity including its onset and offset was further regressed by the model. The output was a set of detected manipulation activities, each with its label, temporal onset and offset, and a confidence score. The model, as shown in Supplementary Table 5, was learned with human annotated activities on the training set using the AdamW optimization method40 for 60 epochs (learning rate of 2e−4 and batch size of 8). Additionally, Supplementary Table 6 presents an ablation study of the model design.
Integration of physiological signals
To quantify physiological responses of care manipulation activities, heart rates and SpO2 levels accompanying the videos were further integrated with video-based activity detection results. The videos and physiological signals were synchronized during recording using the NEO device. Videos without synchronized physiological signals and events lacking physiological readings (due to varying temporal resolution of physiological signals and the duration of events) were excluded from the analysis of physiological responses. Based on the temporal boundary of a detected activity or a human annotated activity, the physiological signals were averaged prior to (as the baseline), during, and subsequent to the activity.
Quantification of NISS scores
The NISS12 considers a wide array of common procedures in NICUs. As a first step, our study explores the quantification of cumulative stress associated with diaper change, regarded as a procedure of moderately stress. The corrected gestational age (CGA) of the neonates was collected and assumed known at the point of each video recording. Videos with CGA > 37 weeks were excluded from the analysis, as the NISS scores are not defined for this age group. A total of 258 videos were analyzed, where two sets of NISS scores for diaper change were computed using algorithm predicted frequency and using human annotated frequency, following12. These two set of scores were further compared.
Evaluation protocol and statistical analysis
For a fair evaluation of our method, a 5-fold cross-validation was adapted. A stratified sampling, accounting for the number of samples in each of the activity categories, was performed to split the dataset into five non-overlapping folds, each with approximately equal number of activities. Details of the dataset splits are described in Supplementary Table 7. For each fold, our method was trained on the remaining 4 folds and evaluated on the current fold. The results were further aggregated across all 5 test splits. Different evaluation protocols, encompassing multiple metrics, were considered for the temporal localization of care manipulation activities, and for the quantification of duration, frequency, and physiological responses of these activities.
For activity detection, a tIoU threshold was considered when matching the predicted activities to human annotated ones, following the standard practice41,42. The tIoU between a predicted activity and a human annotated activity was computed as the intersection of the two events divided by their union. A match between a predicted activity and a human annotated activity was determined if the two events shared the same label and their tIoU was larger than the threshold. A prediction was considered as true positive (correct prediction) if the predicted activity can be matched to one of the annotated activities and the corresponding annotation has not been designated to another prediction. Otherwise, the prediction was counted as false positive (incorrect prediction). An annotated event was regarded as false negative (missed event) if no prediction can be matched to the activity. Per-category precision-recall curves were subsequently calculated based on true positives, false positives, and false negatives. These precision-recall curves fully characterize the performance of activity detection. A comprehensive set of metrics derived from the precision-recall curves were evaluated. Precision and recall were summarized based on the best operating point with the highest F1 score on each precision-recall curve and aggregated across all splits. Per-category average precision computed as the area under the precision-recall curve was also reported.
For the quantification of duration, frequency, physiological responses, and NISS scores, target variables (e.g., frequency of diaper change activities) were computed for each video using either algorithmic predictions or human annotations. Variants of paired t-tests were further conducted to compare the results from the two groups.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
link