AAI_2025_Capstone_Chronicles_Combined

8

as fast, privacy-preserving baselines, while the 3D-CNN consumes short clips of raw frames to capture appearance and motion cues that are not visible in pose alone (Bai et al., 2018; Tran et al., 2015). The MS-TCN builds on these ideas by predicting intake at every frame rather than at the window level, providing a denser and more expressive temporal model for evaluating intake timelines and pace. For the window-based models (baseline TCN and RGB 3D-CNN), we operate on the fixed-length temporal windows constructed as described in section 2. Each window is assigned a binary label indicating intake or background based on its overlap with annotated intake events. To avoid leakage, all windows from a given eating session are assigned exclusively to either the training, validation, or test split. This windowed representation enables the models to learn short, localized temporal patterns associated with intake while operating on uniformly sized inputs. The pose-based branch uses a strong TCN backbone. For each window, we load the full sequence of 2D joint coordinates from the EatSense pose files, subsample or interpolate them to a fixed number of timesteps, and standardize the features with a per-window z-score. These sequences have shape (T, F), where T is the number of timesteps and F is the number of pose features. We propose PoseTCNPro, a pose-only Temporal Convolutional Network that operates on fixed length temporal windows and serves as our primary pose-based model. PoseTCNPro is designed to capture short- to medium-range temporal dependencies in wrist and arm motion while remaining lightweight enough for efficient inference.

The model first applies a 1D convolution to project the F-dimensional pose feature vector into a base channel width, followed by a stack of dilated temporal convolutional blocks with depth-wise separable convolutions, residual connections, and squeeze-and-excitation (SE) attention over channels. Dilation factors increase across blocks (e.g., 1, 2, 4, 8, 16), allowing the network to achieve a large temporal receptive field while keeping the number of parameters modest. An attention-based pooling layer aggregates the temporal dimension into a single feature vector, which is passed through a small fully connected head to produce a single logit for binary intake prediction. A second implementation of this architecture in Keras is tuned with Hyperband, exploring a range of base widths, dropout rates, and learning rates while keeping the overall structure fixed. For the RGB baseline, we use a compact residual 3D convolutional neural network, which we refer to as VideoResNet3D. This model serves as the RGB 3D-CNN in our comparison. For each temporal window, we load the corresponding frames from disk, resize them to a square resolution, and uniformly sample a fixed length clip (e.g., 16 frames) across the window interval. This produces an input tensor of shape (3, T, H, W). The VideoResNet3D architecture begins with a 3D convolutional “stem” and then applies several stages of residual 3D blocks with spatial and temporal down-sampling. Each block includes two 3D convolutions, batch normalization, GELU activations, dropout, and a squeeze-and-excitation (SE) module that re-weights channels based on global spatiotemporal context. After the final stage, we apply an attention pooling layer over time (averaging over the spatial dimensions) so

390

Made with FlippingBook - Share PDF online