docs: track paper sources and scoliosis experiment updates
This commit is contained in:
@@ -0,0 +1,340 @@
|
||||
% This is samplepaper.tex, a sample chapter demonstrating the
|
||||
% LLNCS macro package for Springer Computer Science proceedings;
|
||||
% Version 2.21 of 2022/01/12
|
||||
%
|
||||
\documentclass[runningheads]{llncs}
|
||||
%
|
||||
\usepackage[T1]{fontenc}
|
||||
% T1 fonts will be used to generate the final print and online PDFs,
|
||||
% so please use T1 fonts in your manuscript whenever possible.
|
||||
% Other font encondings may result in incorrect characters.
|
||||
%
|
||||
\usepackage{graphicx}
|
||||
\usepackage{multirow}
|
||||
\usepackage{amsmath}
|
||||
\usepackage{booktabs}
|
||||
\usepackage[font=small,labelfont=bf]{caption}
|
||||
\usepackage{url}
|
||||
% Used for displaying a sample figure. If possible, figure files should
|
||||
% be included in EPS format.
|
||||
%
|
||||
% If you use the hyperref package, please uncomment the following two lines
|
||||
% to display URLs in blue roman font according to Springer's eBook style:
|
||||
%\usepackage{color}
|
||||
%\renewcommand\UrlFont{\color{blue}\rmfamily}
|
||||
%\urlstyle{rm}
|
||||
%
|
||||
\begin{document}
|
||||
%
|
||||
|
||||
\title{Gait Patterns as Biomarkers: A Video-Based Approach for Classifying Scoliosis}
|
||||
\titlerunning{A Video-Based Approach for Classifying Scoliosis}
|
||||
|
||||
%
|
||||
%\titlerunning{Abbreviated paper title}
|
||||
% If the paper title is too long for the running head, you can set
|
||||
% an abbreviated paper title here
|
||||
%
|
||||
|
||||
|
||||
\authorrunning{Z. Zhou et al.}
|
||||
\author{
|
||||
Zirui Zhou\inst{1}$^{\dagger}$ % index{Zirui, Zhou}
|
||||
\and Junhao Liang\inst{1}$^{\dagger}$
|
||||
% index{Junhao, Liang}
|
||||
\and Zizhao Peng\inst{1,2} % index{Zizhao, Peng}
|
||||
\and Chao Fan\inst{1} % index{Chao, Fan}
|
||||
\and \\ Fengwei An\inst{1} % index{Fengwei, An}
|
||||
\and Shiqi Yu\inst{1}$^{*}$ % index{Shiqi, Yu}
|
||||
}
|
||||
% index{Last Name, First Name}
|
||||
\institute{Southern University of Science and Technology, Shenzhen, China
|
||||
\and
|
||||
The Hong Kong Polytechnic University, Hong Kong, China
|
||||
%\email{yusq@sustech.edu.cn}
|
||||
}
|
||||
|
||||
|
||||
%\authorrunning{F. Author et al.}
|
||||
% First names are abbreviated in the running head.
|
||||
% If there are more than two authors, 'et al.' is used.
|
||||
%
|
||||
% \institute{Princeton University, Princeton NJ 08544, USA \and
|
||||
% Springer Heidelberg, Tiergartenstr. 17, 69121 Heidelberg, Germany
|
||||
% \email{lncs@springer.com}\\
|
||||
% \url{http://www.springer.com/gp/computer-science/lncs} \and
|
||||
% ABC Institute, Rupert-Karls-University Heidelberg, Heidelberg, Germany\\
|
||||
% \email{\{abc,lncs\}@uni-heidelberg.de}}
|
||||
%
|
||||
\maketitle % typeset the header of the contribution
|
||||
\newcommand\blfootnote[1]{%
|
||||
\begingroup
|
||||
\renewcommand\thefootnote{}\footnote{#1}%
|
||||
\addtocounter{footnote}{-4}%
|
||||
\endgroup
|
||||
}
|
||||
%
|
||||
\begin{abstract}
|
||||
Scoliosis presents significant diagnostic challenges, particularly in adolescents, where early detection is crucial for effective treatment. Traditional diagnostic and follow-up methods, which rely on physical examinations and radiography, face limitations due to the need for clinical expertise and the risk of radiation exposure, thus restricting their use for widespread early screening. In response, we introduce a novel video-based, non-invasive method for scoliosis classification using gait analysis, effectively circumventing these limitations. This study presents Scoliosis1K, the first large-scale dataset specifically designed for video-based scoliosis classification, encompassing over one thousand adolescents. Leveraging this dataset, we developed ScoNet, an initial model that faced challenges in handling the complexities of real-world data. This led to the development of ScoNet-MT, an enhanced model incorporating multi-task learning, which demonstrates promising diagnostic accuracy for practical applications. Our findings demonstrate that gait can serve as a non-invasive biomarker for scoliosis, revolutionizing screening practices through deep learning and setting a precedent for non-invasive diagnostic methodologies. The dataset and code are publicly available at \url{https://zhouzi180.github.io/Scoliosis1K/}.
|
||||
|
||||
\blfootnote{$^{\dagger}$ Equal contribution. \\
|
||||
$^{*}$ Corresponding author: yusq@sustech.edu.cn.
|
||||
|
||||
}
|
||||
|
||||
\keywords{Scoliosis \and Gait analysis \and Non-invasive screening \and Deep learning \and Computer vision.}
|
||||
\end{abstract}
|
||||
\begin{figure*}[ht]
|
||||
\centering
|
||||
\includegraphics[width=0.9\textwidth]{image/fig_intro2.pdf}
|
||||
\caption{Comparative overview of scoliosis diagnosis methods: (a) Traditional X-ray examination, the clinical gold standard~\cite{Thuaimer2024}; (b) Non-invasive analysis of bareback photos~\cite{zhang2023deep}; (c) Our proposed gait analysis approach, enabling efficient, large-scale early adolescent screening to identify cases requiring further radiographic investigation, highlighting its non-invasive and privacy-preserving characteristics.}
|
||||
\label{fig:intro}
|
||||
\end{figure*}
|
||||
\section{Introduction}\label{sec:introduction}
|
||||
Scoliosis, a complex spinal deformity characterized by three-dimensional curvature, significantly impacts adolescents' physical well-being and quality of life globally. Clinically, scoliosis is assessed by measuring the Cobb angle, defined as the angle between the upper and lower end vertebrae on standing X-rays (Figure~\ref{fig:intro} (a)), with a threshold exceeding $10^\circ$ indicating scoliosis~\cite{weinstein2008adolescent,reamy2001adolescent}. This early-stage condition is often asymptomatic, potentially leading to severe health issues if undiagnosed~\cite{payne1997does}. In China, scoliosis affects approximately 5.14\% of school-aged children~\cite{hengwei2016prevalence}, highlighting the critical need for effective early screening methods.
|
||||
|
||||
Traditionally, scoliosis diagnosis and monitoring have relied on physical exams and radiography, which require significant clinical expertise and expose patients to radiation, limiting early, widespread screening. Innovations in deep learning have prompted the exploration of non-invasive scoliosis assessment methods, such as bareback photo analysis~\cite{yang2019development,zhang2023deep}. However, these alternatives often raise concerns about privacy and efficiency. To address these challenges, we introduce a novel video-based method using gait as a biomarker for scoliosis, eliminating the need for direct bodily exposure (Figure~\ref{fig:intro} (b) v.s. (c)). Despite its potential, there is a lack of related work benchmarking this promising direction, primarily due to the absence of public datasets and baseline models.
|
||||
|
||||
In response, we developed Scoliosis1K—a groundbreaking dataset featuring over 1k adolescents and 400k frames, setting new standards for video-based adolescent scoliosis screening. To respect privacy and enhance usability, we opted for silhouette data and utilized existing gait recognition techniques~\cite{gaitset,opengait,gaitpart,liang2022gaitedge,fan2023learning} to create ScoNet, demonstrating the viability of gait as a scoliosis biomarker. Furthermore, ScoNet's initial vulnerability to sample imbalance led to the development of ScoNet-MT, an enhanced multi-task learning model.
|
||||
|
||||
This work significantly contributes to the field by (1) creating Scoliosis1K, the first large-scale dataset for video-based scoliosis classification, establishing a new benchmark for the research community; (2) introducing ScoNet, the first baseline model for scoliosis classification through gait analysis, and evolving it into ScoNet-MT to better handle real-world data complexities; and (3) demonstrating that ScoNet-MT exhibits promising diagnostic accuracy for practical applications, underscoring the potential of gait as a non-invasive biomarker for scoliosis and showcasing the transformative impact of deep learning in healthcare diagnostics.
|
||||
|
||||
|
||||
\begin{figure*} [!t]
|
||||
\centering
|
||||
\includegraphics[width=0.8\textwidth]{image/fig_dataset.pdf}
|
||||
\caption{Silhouettes from the Scoliosis1K dataset: (a) positive, (b) neutral, and (c) negative samples.}
|
||||
\label{fig:dataset}
|
||||
\end{figure*}
|
||||
|
||||
\section{Dataset}\label{sec:dataset}
|
||||
\subsection{Overview of Scoliosis1K}
|
||||
To our knowledge, Scoliosis1K is the first large-scale dataset specifically designed for video-based scoliosis classification. The samples are categorized into three diagnostic groups: positive (Cobb angle $>$ 10$^\circ$), neutral (Cobb angle $\approx$ 10$^\circ$), and negative (Cobb angle $<$ 10$^\circ$). The dataset comprises 1,050 adolescent participants from a middle school in China. It includes 447,900 silhouette images from 1,493 video sequences. Figure~\ref{fig:dataset} shows three sequences representing the different categories. The study received approval from the relevant ethics committee, with informed consent obtained from all participants and their guardians.
|
||||
|
||||
\subsection{Data Collection and Preprocessing}
|
||||
The videos were captured at 720p resolution using a camera. Participants were instructed to walk along a corridor. The camera was positioned 1.4 to 4.2 meters from the participants. Data collection was designed to simulate a controlled yet natural walking environment conducive to accurate biomechanical analysis. Each sequence, containing approximately 300 frames at 15 frames per second, was annotated by scoliosis experts. Experts used well-established scoliosis screening methods, including visual assessments and the Adams forward bend test, to classify participants. Experts did not classify participants by watching videos; instead, they assessed the participants' bodies directly. Subsequently, all videos from a participant were labeled according to their assessed category.
|
||||
|
||||
Due to the relative scarcity of positive scoliosis cases, diagnosed individuals were encouraged to contribute multiple sequences. This approach helped mitigate potential class imbalance, enhancing the dataset's analytical reliability and robustness.
|
||||
|
||||
During preprocessing, raw video footage was rigorously converted into binary silhouette sequences. Choosing silhouettes serves two primary purposes: anonymizing participant data to safeguard privacy, and focusing the deep learning model on the body region rather than the background. These silhouettes, a fundamental component of our dataset, retain critical gait features while discarding irrelevant background information. A detailed description of silhouette extraction is provided in Section~\ref{sec:preprocessing}.
|
||||
|
||||
|
||||
\begin{table*}[t]
|
||||
\caption{Statistics of the Scoliosis1K dataset, highlighting diversity in diagnostic categories and participant demographics.}
|
||||
\centering
|
||||
\resizebox{\linewidth}{!}{
|
||||
\begin{tabular}{l|l|lllllll}
|
||||
\hline
|
||||
& Attributes & All & & Positive & & Neutral & & Negative \\ \hline
|
||||
\multirow{6}{*}{Scoliosis1K} & Number of Participants & 1050 & & 176 & & 82 & & 792 \\
|
||||
& Number of Sequences & 1493 & & 493 & & 200 & & 800 \\
|
||||
& Sex(F/M) & 641/409 & & 113/63 & & 49/33 & & 479/313 \\
|
||||
& Age(years, mean $\pm$ std) & 15.2 $\pm$ 1.5 & & 14.3 $\pm$ 1.0 & & 14.0 $\pm$ 0.6 & & 15.5 $\pm$ 1.5 \\
|
||||
& Height(cm, mean $\pm$ std) & 163.2 $\pm$ 8.8 & & 161.6 $\pm$ 7.1 & & 161.4 $\pm$ 6.7 & & 163.7 $\pm$ 9.3 \\
|
||||
& Weight(kg, mean $\pm$ std) & 51.9 $\pm$ 10.7 & & 48.3 $\pm$ 8.4 & & 46.7 $\pm$ 7.8 & & 53.3 $\pm$ 11.1 \\ \hline
|
||||
\end{tabular}
|
||||
|
||||
}
|
||||
|
||||
\label{tab:dataset}
|
||||
\end{table*}
|
||||
%\vspace{-8mm}
|
||||
\subsection{Demographics and Dataset Characteristics}
|
||||
|
||||
Table~\ref{tab:dataset} provides a demographic and clinical overview of Scoliosis1K, illustrating its comprehensiveness for adolescent scoliosis screening. The dataset's design emphasizes scale and diversity, encompassing a wide array of participant attributes and gait patterns. This variety enhances the dataset's potential for training deep models in scoliosis classification based on gait.
|
||||
|
||||
\subsection{Implications for Scoliosis Research}
|
||||
Scoliosis1K can advance scoliosis diagnosis research in the following aspects:
|
||||
|
||||
\begin{itemize}
|
||||
\item \textbf{Scale and Scope:} To our knowledge, it is the first large-scale dataset for automatic scoliosis diagnosis, making computer vision-based scoliosis diagnosis feasible.
|
||||
\item \textbf{Innovation in Non-Invasive Screening:} Scoliosis1K provides high-quality, annotated silhouette data, addressing the critical need for non-invasive diagnostic tools in scoliosis screening. This fosters innovation by enabling the exploration of methods that prioritize patient safety and privacy.
|
||||
\end{itemize}
|
||||
|
||||
Furthermore, Scoliosis1K bridges a critical gap in the availability of high-quality, annotated images for non-invasive scoliosis screening. This contribution not only catalyzes innovation in healthcare technology but also lays the groundwork for expansive future research in automated scoliosis diagnosis. The dataset paves the way for improving public health, particularly in regions with limited medical services.
|
||||
|
||||
|
||||
|
||||
\begin{figure*} [ht]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{image/fig_pipeline.pdf}
|
||||
\caption{\textbf{The Proposed Pipeline:} The participant is tracked throughout the video, excluding non-participant entities like clinicians. The participant's silhouette is then segmented, followed by scoliosis classification using ScoNet-MT based on gait analysis.}
|
||||
\label{fig:pipeline}
|
||||
\vspace{0mm}
|
||||
\end{figure*}
|
||||
%\vspace{-10mm}
|
||||
\section{Methodology}\label{sec:method}
|
||||
This section presents our novel approach to scoliosis classification through gait pattern analysis in videos. The method involves three key stages: participant tracking, segmentation, and gait-based scoliosis classification, as shown in Figure~\ref{fig:pipeline}.
|
||||
|
||||
|
||||
\subsection{Participants Tracking and Segmentation}\label{sec:preprocessing}
|
||||
We used BYTETracker~\cite{bytetrack} for precise person tracking in videos. BYTETracker enhances tracking accuracy by evaluating each detection box, recovering occluded objects, and eliminating irrelevant false detections. It employs tracklet similarity to effectively differentiate between the participant and background, ensuring consistent tracking.
|
||||
|
||||
After tracking, participant segmentation is performed by feeding the cropped image region into PP-HumanSeg~\cite{yuan2020object,liu2021paddleseg}, a robust deep model for precise human segmentation. PP-HumanSeg employs a deep neural network architecture combining an encoder, Spatial Pyramid Pooling Module (SPPM), and Flexible Lightweight Decoder (FLD) to generate binary masks, or silhouettes. We then normalized the silhouettes to a fixed size following gait recognition methods~\cite{iwama2012isir}.
|
||||
|
||||
|
||||
\subsection{Scoliosis Classification based on Gait}
|
||||
Recognizing gait as a non-invasive biomarker for scoliosis, we introduce ScoNet and its enhanced model, ScoNet-MT, designed to automate the classification process with high efficiency.
|
||||
|
||||
\textbf{ScoNet} employs a ResNet-inspired $\mathcal{E}$ architecture to convert participant silhouettes $\mathbf{s}$ into 3D feature maps $\mathbf{f}$:
|
||||
\[\mathbf{f}=\mathcal{E}(\mathbf{s}) \in R^{n\times c \times h \times w},\]
|
||||
where $n$, $c$, $h$, and $w$ represent gait frames, channels, height, and width, respectively. Temporal Pooling (TP)~\cite{gaitset} condenses these into essential features $z$:
|
||||
\[z=TP({\mathbf{f}}) \in R^{c \times h \times w}.\]
|
||||
Horizontal Pooling (HP)~\cite{fu2019horizontal} further segments these maps, pooling them into vectors $z_{s}$, with global pooling across 16 horizontal segments for comprehensive feature extraction:
|
||||
\[\mathbf{f}^{\prime} = \text{maxpool}(z_{s}) + \text{avgpool}(z_{s}).\]
|
||||
A fully connected layer then maps these vectors into the metric space, with BNNeck~\cite{bnneck} refining the feature space before final classification using cross-entropy loss:
|
||||
\[L_{ce} = -\sum_{i=1}^{n} y_i \log(\widehat{y_i}).\]
|
||||
|
||||
\textbf{ScoNet-MT} builds on ScoNet by incorporating multi-task learning with a Gait Recognition (GR) task that highlights distinct human motion patterns, reducing bias from non-gait factors. The model uses triplet loss to distinguish subtle gait variations critical for scoliosis classification. In each training batch, $N$ triplets are formed, each comprising an anchor sequence $\mathbf{s}_i^a$, a positive sequence $\mathbf{s}_i^p$ of the same identity, and a negative sequence $\mathbf{s}_i^n$ of a different identity:
|
||||
\[
|
||||
L_{triplet} = \sum_{i=1}^{N} \max\left(0, \|f(\mathbf{s}_i^a) - f(\mathbf{s}_i^p)\|_2^2 - \|f(\mathbf{s}_i^a) - f(\mathbf{s}_i^n)\|_2^2 + \alpha\right),
|
||||
\]
|
||||
where $f(\cdot)$ is the embedding function transforming each sequence into an embedding space, $|\cdot|_2$ is the euclidean norm, and $\alpha$ is a margin enforcing separation between positive and negative matches. This setup enhances the model’s capacity to discriminate between classes. The total loss function, combining cross-entropy and triplet losses, optimizes ScoNet-MT for accurate scoliosis classification:
|
||||
\[
|
||||
L_{total} = L_{ce} + L_{triplet}.
|
||||
\]
|
||||
|
||||
|
||||
|
||||
\section{Experiments}\label{sec:experiments}
|
||||
\subsection{Setup}
|
||||
\textbf{Evaluation Protocol.} The dataset was divided into a training set with 745 sequences and a test set with 748 sequences, maintaining a realistic ratio of positive:neutral:negative samples at 1:1:8 in the training set. Specifically, the sequence counts for the three classes are 74, 74, and 596. Model performance was evaluated using three metrics: accuracy, sensitivity, and specificity, defined as follows:
|
||||
\begin{itemize} \item \textbf{Accuracy:} The proportion of correctly classified samples out of the total number of samples. \item \textbf{Sensitivity:} The proportion of true positives (actual scoliosis cases correctly classified) out of all positive cases. \item \textbf{Specificity:} The proportion of true negatives (actual normal cases correctly classified) out of all negative cases. \end{itemize}
|
||||
|
||||
\noindent
|
||||
\textbf{Implementation Details.} Our models, ScoNet and ScoNet-MT, were implemented using PyTorch~\cite{paszke2019pytorch} and OpenGait~\cite{opengait}, with input image sizes of $64 \times 44$. The training utilized triplet loss with a margin of 0.2. The positive sequence in each triplet was selected from the same gait sequence as the anchor sequence but consisted of different frames. For experiments, 30 frames were selected from each gait sequence as input. All models were trained using an SGD optimizer~\cite{ruder2016overview} with an initial learning rate of 0.1 and a weight decay of 0.0005. The learning rate was reduced by a factor of 10 at 10,000, 14,000, and 18,000 iterations, with training continuing for 20,000 iterations.
|
||||
|
||||
|
||||
\begin{table}[!t]
|
||||
\begin{minipage}[!t]{.60\textwidth }%
|
||||
\caption{Comparison of our method with conventional scoliosis screening techniques. $^*$ indicates results directly cited from~\cite{karachalios1999ten}. The best results are highlighted in bold.}
|
||||
|
||||
\label{tab:method}
|
||||
\centering
|
||||
|
||||
\begin{tabular}{c|ccc}
|
||||
\hline
|
||||
Method & Accuracy & Sensitivity & Specificity \\ \hline
|
||||
Adams Test$^*$~\cite{karachalios1999ten} & - & 84.4\% & \textbf{95.2\%} \\
|
||||
Scoliometer$^*$~\cite{karachalios1999ten} & - & 90.6\% & 79.8\% \\
|
||||
ScoNet (Ours) & 51.3\% & \textbf{100.0\%} & 33.2\% \\
|
||||
ScoNet-MT (Ours) & \textbf{82.0\%} & 99.0\% & 76.5\% \\ \hline
|
||||
\end{tabular}
|
||||
\hrule height 0pt
|
||||
\end{minipage}%
|
||||
~~
|
||||
\begin{minipage}[!t]{.38\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=0.8\textwidth]{image/fig_cm.pdf}
|
||||
%\includesvg[width=170pt]{best-cm-blue.svg}
|
||||
%\vspace{0.01pt}
|
||||
\captionsetup{type=figure}
|
||||
\caption{Confusion matrix for ScoNet-MT.}
|
||||
\label{fig:cm}
|
||||
\hrule height -12pt
|
||||
\end{minipage}
|
||||
\end{table}
|
||||
|
||||
|
||||
|
||||
%\vspace{-10mm}
|
||||
|
||||
|
||||
\subsection{Results}
|
||||
ScoNet-MT shows significant improvements in accuracy and specificity compared to the initial ScoNet model, as shown in Table~\ref{tab:method}. The model's accuracy increased by 30.7\%, and specificity by 43.3\%, highlighting the effectiveness of multi-task learning. Although ScoNet-MT's sensitivity surpasses traditional methods like the \textit{Adams forward bend test} and the \textit{Scoliometer}, its specificity, while slightly lower, suggests room for further enhancement. These findings highlight the potential of our approach to surpass human expertise in adolescent scoliosis screening. The confusion matrix (Figure~\ref{fig:cm}) offers detailed insights into the model's precision, particularly in distinguishing between negative cases and others, despite some overlap between positive and neutral classifications.
|
||||
|
||||
The heatmaps in Figure~\ref{fig:heatmap} show the regions of interest highlighted by our model using the technique from~\cite{heatmap2017}. While ScoNet mainly focuses on the extremities, ScoNet-MT, through multi-task learning, extends its analysis to critical areas like the head and shoulders, aligning with key motion pattern indicators related to scoliosis identified in the literature~\cite{mahaudens2009gait,kramers2004gait,zhu2021comparison,wen2022trunk}.
|
||||
|
||||
|
||||
\begin{figure} [t]
|
||||
\centering
|
||||
\includegraphics[width=0.8\textwidth]{image/fig_heatmap.pdf}
|
||||
\caption{Visualization results of our method.}
|
||||
\label{fig:heatmap}
|
||||
\end{figure}
|
||||
|
||||
|
||||
\begin{table}[t]
|
||||
\begin{minipage}[b]{.56\textwidth }%
|
||||
\caption{Performance comparison of ScoNet-MT against baseline models and ScoNet. The best results are highlighted in bold.}%$^\natural$ indicates a variant of ScoNet-MT.
|
||||
|
||||
\label{tab:ablation}
|
||||
\centering
|
||||
\resizebox{\linewidth}{!}{\begin{tabular}{l|ccc}
|
||||
\hline
|
||||
Method & Accuracy & Sensitivity & Specificity \\ \hline
|
||||
1) Baseline CNN & 45.5\% & 99.0\% & 27.0\% \\
|
||||
2) Baseline CNN-Triplet & 56.3\% & 77.0\% & 85.0\% \\
|
||||
3) Baseline CNN-MT & 57.8\% & 96.1\% & 44.4\% \\
|
||||
4) ScoNet & 51.3\% & 100.0\% & 33.2\% \\
|
||||
5) ScoNet-Triplet & 61.2\% & 100.0\% & 46.6\% \\
|
||||
ScoNet-MT (Ours) & 82.0\% & 99.0\% & 76.5\% \\ \hline
|
||||
\end{tabular}}
|
||||
\hrule height 0pt
|
||||
\end{minipage}%
|
||||
~~
|
||||
\begin{minipage}[b]{.38\textwidth }%
|
||||
\caption{Test set accuracy of ScoNet and ScoNet-MT, as influenced by various class distributions (positive, neutral, negative) during training.}
|
||||
|
||||
|
||||
\label{tab:ratio}
|
||||
\centering
|
||||
\resizebox{0.9\linewidth}{!}{\begin{tabular}{c|cc}
|
||||
\hline
|
||||
Pos:Neu:Neg & ScoNet & ScoNet-MT \\ \hline
|
||||
1:1:2 & 91.4\% & 95.2\% \\ \hline
|
||||
1:1:4 & 88.6\% & 90.5\% \\ \hline
|
||||
1:1:8 & 51.3\% & 82.0\% \\ \hline
|
||||
1:1:16 & 23.7\% & 49.5\% \\ \hline
|
||||
\end{tabular}}
|
||||
\hrule height 0pt
|
||||
\end{minipage}%
|
||||
\end{table}
|
||||
|
||||
\subsection{Ablation Studies}
|
||||
|
||||
\textbf{Baseline Comparisons.} As this is the first video-based solution using a large-scale image dataset and a deep model, we could not find similar methods in the literature for comparison. To demonstrate feasibility and effectiveness, we compared ScoNet-MT with various baseline models in the ablation experiments.
|
||||
The baseline models include:
|
||||
1) \textit{Baseline CNN:} A basic model trained with cross-entropy loss, excluding horizontal pooling and BNNeck.
|
||||
2) \textit{Baseline CNN-Triplet:} An improved version of the Baseline CNN, adding triplet loss to enhance category differentiation.
|
||||
3) \textit{Baseline CNN-MT:} An extension of the Baseline CNN, incorporating multi-task learning with identity information for better generalization.
|
||||
4) \textit{ScoNet:} Our initial model.
|
||||
5) \textit{ScoNet-Triplet:} A variation of ScoNet that includes triplet loss to improve feature discrimination.
|
||||
Table~\ref{tab:ablation} presents the performance comparison, showing that ScoNet-MT significantly outperforms all baseline models, particularly highlighting the effectiveness of multi-task learning in enhancing diagnostic precision.
|
||||
|
||||
\noindent
|
||||
\textbf{Class-Imbalanced Distribution.} Given the inherent imbalance in scoliosis case distribution, we evaluated the performance of ScoNet and ScoNet-MT across different class ratio settings in the training set. The sequence numbers for these settings were 186:186:373 for approximately 1:1:2, 124:124:497 for 1:1:4, and 41:41:663 for 1:1:16, reflecting real-world scenarios where negative cases predominate. Table~\ref{tab:ratio} presents our findings, indicating that ScoNet-MT consistently outperforms ScoNet across all ratios. This robust performance, even in highly imbalanced conditions, underscores ScoNet-MT's adaptability and ability to mitigate overfitting, highlighting its superior diagnostic accuracy in realistic settings.
|
||||
|
||||
|
||||
|
||||
|
||||
\section{Conclusion}\label{sec:conclusion}
|
||||
Our study demonstrates the effectiveness of using gait as a non-invasive biomarker for scoliosis. The introduction of the Scoliosis1K dataset, along with the development of the ScoNet and ScoNet-MT models, marks a significant advancement in this field, enabling early and accurate scoliosis classification.
|
||||
|
||||
The implications of our work are far-reaching, offering a scalable and privacy-preserving diagnostic tool with the potential to revolutionize adolescent scoliosis screening, especially in resource-limited regions. Future work will focus on expanding the diversity of our dataset, identifying additional biomarkers, and exploring more effective methods. Developing a mature large-scale scoliosis screening solution using vision-based gait analysis could benefit a vast number of children, particularly in developing countries.
|
||||
|
||||
|
||||
|
||||
|
||||
\begin{credits}
|
||||
\subsubsection{\ackname}
|
||||
This work was supported in part by the Shenzhen International Research Cooperation Project (Grant No. GJHZ20220913142611021) and ACCESS (AI Chip Center for Emerging Smart Systems), which is sponsored by InnoHK funding, Hong Kong.
|
||||
|
||||
|
||||
\subsubsection{\discintname}
|
||||
The authors have no competing interests to declare that are relevant to the content of this article.
|
||||
% It is now necessary to declare any competing interests or to specifically
|
||||
% state that the authors have no competing interests. Please place the
|
||||
% statement with a bold run-in heading in small font size beneath the
|
||||
% (optional) acknowledgments\footnote{If EquinOCS, our proceedings submission
|
||||
% system, is used, then the disclaimer can be provided directly in the system.},
|
||||
% for example: The authors have no competing interests to declare that are
|
||||
% relevant to the content of this article. Or: Author A has received research
|
||||
% grants from Company W. Author B has received a speaker honorarium from
|
||||
% Company X and owns stock in Company Y. Author C is a member of committee Z.
|
||||
\end{credits}
|
||||
|
||||
\bibliographystyle{splncs04}
|
||||
\bibliography{refs}
|
||||
|
||||
\end{document}
|
||||
@@ -0,0 +1,386 @@
|
||||
% This is a modified version of Springer's LNCS template suitable for anonymized MICCAI 2025 main conference submissions.
|
||||
% Original file: samplepaper.tex, a sample chapter demonstrating the LLNCS macro package for Springer Computer Science proceedings; Version 2.21 of 2022/01/12
|
||||
% !TEX root = main.tex
|
||||
\documentclass[runningheads]{llncs}
|
||||
%
|
||||
\usepackage[T1]{fontenc}
|
||||
% T1 fonts will be used to generate the final print and online PDFs,
|
||||
% so please use T1 fonts in your manuscript whenever possible.
|
||||
% Other font encodings may result in incorrect characters.
|
||||
%
|
||||
\usepackage{float} % 在导言区加入
|
||||
\usepackage{hyperref}
|
||||
|
||||
\usepackage{subcaption}
|
||||
\usepackage{caption}
|
||||
\usepackage{graphicx,verbatim}
|
||||
\usepackage{amsmath}
|
||||
\usepackage{amssymb}
|
||||
\usepackage{multirow}
|
||||
\usepackage[table,xcdraw]{xcolor}
|
||||
\usepackage{multirow}
|
||||
\usepackage{booktabs}
|
||||
\usepackage{algorithm}
|
||||
\usepackage{algorithmic}
|
||||
\usepackage{times}
|
||||
\usepackage{epsfig}
|
||||
\usepackage{graphicx}
|
||||
\usepackage{amsmath}
|
||||
\let\proof\relax
|
||||
\let\endproof\relax
|
||||
\usepackage{amsthm}
|
||||
\usepackage{amssymb}
|
||||
\usepackage{mathrsfs}
|
||||
\usepackage{multirow}
|
||||
\usepackage{threeparttable}
|
||||
\usepackage{xcolor}
|
||||
%
|
||||
% Used for displaying a sample figure. If possible, figure files should
|
||||
% be included in EPS format.
|
||||
%
|
||||
% If you use the hyperref package, please uncomment the following two lines
|
||||
% to display URLs in blue roman font according to Springer's eBook style:
|
||||
\usepackage{color}
|
||||
\renewcommand\UrlFont{\color{blue}\rmfamily}
|
||||
%\urlstyle{rm}
|
||||
%
|
||||
\begin{document}
|
||||
%
|
||||
\title{Pose as Clinical Prior: Learning Dual Representations for Scoliosis Screening}
|
||||
\titlerunning{Learning Dual Representations for Scoliosis Screening}
|
||||
%
|
||||
%
|
||||
% \begin{comment} %% Removed for anonymized MICCAI 2025 submission
|
||||
\authorrunning{Z. Zhou et al.}
|
||||
\author{
|
||||
Zirui Zhou\inst{1} % index{Zirui, Zhou}
|
||||
% \and Junhao Liang\inst{1}$^{\dagger}$
|
||||
% index{Junhao, Liang}
|
||||
\and Zizhao Peng\inst{1,2} % index{Zizhao, Peng}
|
||||
\and Dongyang Jin\inst{1}
|
||||
\and Chao Fan\inst{3} % index{Chao, Fan}
|
||||
\and \\ Fengwei An\inst{1} % index{Fengwei, An}
|
||||
\and Shiqi Yu\inst{1}$^{*}$ % index{Shiqi, Yu}
|
||||
}
|
||||
% index{Last Name, First Name}
|
||||
\institute{Southern University of Science and Technology, Shenzhen, China
|
||||
\and
|
||||
The Hong Kong Polytechnic University, Hong Kong, China
|
||||
\and
|
||||
School of Artificial Intelligence, Shenzhen University, Shenzhen, China
|
||||
%\email{yusq@sustech.edu.cn}
|
||||
}
|
||||
% \end{comment}
|
||||
|
||||
% \author{Anonymized Authors} %% Added for anonymized MICCAI 2025 submission
|
||||
% \authorrunning{Anonymized Author et al.}
|
||||
% \institute{Anonymized Affiliations \\
|
||||
% \email{email@anonymized.com}}
|
||||
|
||||
|
||||
\maketitle % typeset the header of the contribution
|
||||
%
|
||||
\begin{abstract}
|
||||
Recent AI-based scoliosis screening methods primarily rely on large-scale silhouette datasets, often neglecting clinically relevant postural asymmetries—key indicators in traditional screening. In contrast, pose data provide an intuitive skeletal representation, enhancing clinical interpretability across various medical applications. However, pose-based scoliosis screening remains underexplored due to two main challenges: (1) the scarcity of large-scale, annotated pose datasets; and (2) the discrete and noise-sensitive nature of raw pose coordinates, which hinders the modeling of subtle asymmetries. To address these limitations, we introduce \textbf{Scoliosis1K-Pose}, a 2D human pose annotation set that extends the original Scoliosis1K dataset, comprising 447,900 frames of 2D keypoints from 1,050 adolescents. Building on this dataset, we introduce the \textbf{Dual Representation Framework (DRF)}, which integrates a continuous \textit{skeleton map} to preserve spatial structure with a discrete \textit{Postural Asymmetry Vector (PAV)} that encodes clinically relevant asymmetry descriptors. A novel \textit{PAV-Guided Attention (PGA)} module further uses the PAV as clinical prior to direct feature extraction from the skeleton map, focusing on clinically meaningful asymmetries. Extensive experiments demonstrate that DRF achieves state-of-the-art performance. Visualizations further confirm that the model leverages clinical asymmetry cues to guide feature extraction and promote synergy between its dual representations. The dataset and code are publicly available at \url{https://zhouzi180.github.io/Scoliosis1K/}.
|
||||
|
||||
\keywords{Scoliosis Screening \and Biometrics in Healthcare \and Computer Vision.}
|
||||
% Authors must provide keywords and are not allowed to remove this Keyword section.
|
||||
|
||||
\end{abstract}
|
||||
%
|
||||
%
|
||||
\section{Introduction}
|
||||
Scoliosis is a complex, three-dimensional spinal deformity that affects approximately 3.1\% of adolescents worldwide~\cite{li2024prevalence}. If untreated, it may lead to chronic pain, respiratory problems, and a diminished quality of life~\cite{weinstein2008adolescent}. Early and accurate screening is therefore essential. Traditional screening methods such as visual inspection, the Adam's forward bend test, and trunk rotation measurements are time-consuming, require specialized clinical expertise, and raise privacy concerns due to the need for exposed torso examinations. These limitations hinder large-scale screening, especially in resource-constrained settings~\cite{kadhim2020status}.
|
||||
|
||||
Artificial intelligence (AI) offers a promising solution to overcome these limitations. Early AI-based methods relied on static trunk imaging using photogrammetry~\cite{zhang2023deep,yang2019development} or 3D surface topography~\cite{adankon2012non,adankon2013scoliosis} to assess back asymmetry. While reducing the need for expert interpretation, these methods still depend on controlled environments and require exposed torsos. Recent approaches explore dynamic biomarkers in naturalistic settings. Visual gait analysis, among these methods, captures walking-related movements and enhances both privacy and scalability in scoliosis screening. Zhou et al.~\cite{zhou2024gait} introduced this paradigm with the Scoliosis1K dataset, comprising human silhouettes extracted from natural walking videos. They also introduced ScoNet-MT, a CNN-based model designed to extract fine-grained body shape features for scoliosis screening. Despite improving data accessibility and reducing acquisition constraints, these methods often overlook clinically relevant asymmetries—such as shoulder imbalance and pelvic tilt~\cite{cma2020guideline}—that are critical for accurate and interpretable assessments.
|
||||
|
||||
Human pose data, with its explicit skeletal structure, offers a natural way to integrate these clinical evaluation criteria and has demonstrated value in various medical applications~\cite{lu2020vision,quan2024causality,islam2023using,wang2024enhancing}. Despite its potential, the use of pose data in scoliosis screening remains underexplored, mainly due to two key challenges: (1) the scarcity of large-scale, annotated pose datasets; and (2) the discrete and noise-sensitive nature of raw pose coordinates, which hinders the modeling of subtle body asymmetries.
|
||||
|
||||
To bridge this gap, we introduce \textbf{Scoliosis1K-Pose}, a 2D human pose annotation set that augments the original Scoliosis1K dataset~\cite{zhou2024gait} for scoliosis screening. It provides 2D keypoints extracted from natural walking videos of adolescents. Each frame includes 17 keypoints in the MS-COCO format~\cite{coco}, extracted using ViTPose~\cite{vitpose}. Building on this dataset, we propose the \textbf{Dual Representation Framework (DRF)}, a pose-based method for scoliosis screening. DRF integrates two complementary pose representations: a continuous \emph{skeleton map} that preserves the global spatial structure of the pose, and a discrete \emph{Postural Asymmetry Vector (PAV)} that explicitly encodes clinically relevant asymmetry descriptors (e.g., vertical, midline, and angular deviations) between paired keypoints. To synergize these representations, our novel \emph{PAV-Guided Attention (PGA)} employs the PAV as a clinical prior to guide the network’s feature extraction, directing attention to the most clinically relevant asymmetries within the skeleton map. This enables DRF to capture subtle yet clinically meaningful postural patterns for scoliosis assessment.
|
||||
|
||||
Our main contributions are threefold:
|
||||
|
||||
\textbf{- Scoliosis1K-Pose:} A new 2D human pose annotation set introduced in this work, augmenting the original Scoliosis1K dataset with detailed keypoint annotations for scoliosis screening.
|
||||
|
||||
\textbf{- Dual Representation Framework (DRF):} A novel, clinically-informed framework that synergistically combines continuous skeleton maps with discrete, clinically-derived PAV. This is implemented through a PGA module that embeds clinical prior into the feature learning process.
|
||||
|
||||
\textbf{- State-of-the-Art Performance:} Extensive experiments demonstrate that DRF can achieve superior performance in scoliosis screening. Visualizations further confirm that the model leverages clinical asymmetry cues to guide feature extraction and promote synergy between its dual representations.
|
||||
|
||||
\begin{figure*} [!t]
|
||||
\centering
|
||||
\includegraphics[width=0.7\textwidth]{figures/dataset_vis.pdf}
|
||||
\caption{Examples from the Scoliosis1K dataset. Left: silhouettes representing (a) positive, (b) neutral, and (c) negative cases. Right: corresponding 2D skeletal keypoints estimated by ViTPose~\cite{vitpose}.}
|
||||
\label{fig:dataset}
|
||||
\end{figure*}
|
||||
|
||||
\section{Dataset}\label{sec:dataset}
|
||||
Our research is built upon the Scoliosis1K dataset~\cite{zhou2024gait}, which originally provided only silhouette data from 1,493 walking videos of 1,050 adolescents. To enable fine-grained postural analysis, we introduce \textit{Scoliosis1K-Pose}, a large-scale 2D human pose annotation set that extends the original Scoliosis1K.
|
||||
|
||||
Scoliosis1K-Pose was constructed by processing 447,900 frames using ViTPose~\cite{vitpose} to extract 17 anatomical keypoints per frame, following the MS-COCO format~\cite{coco}. The extracted keypoints provide detailed skeletal representations of facial features (eyes, ears, nose), upper limbs (shoulders, elbows, wrists), and lower limbs (hips, knees, ankles). This augmentation transforms Scoliosis1K into a comprehensive multi-modal resource that integrates silhouette and 2D pose data, as illustrated in Figure~\ref{fig:dataset}. Participants are categorized as positive (scoliosis), neutral (borderline), or negative (non-scoliosis) according to established screening protocols. For detailed information on the original data collection and its characteristics, please refer to~\cite{zhou2024gait}.
|
||||
|
||||
|
||||
\begin{figure*} [t]
|
||||
\centering
|
||||
\includegraphics[width=0.75\textwidth]{figures/drf.pdf}
|
||||
\caption{Overview of the proposed Dual Representation Framework (DRF). (a) The overall pipeline, from raw pose data to the final scoliosis assessment. (b) The construction process of the Postural Asymmetry Vector (PAV). (c) The architecture of the PAV-Guided Attention (PGA) module, which generates channel and spatial attention weights.}
|
||||
\label{fig:drf}
|
||||
\end{figure*}
|
||||
|
||||
\section{Method}\label{sec:method}
|
||||
We propose the \textbf{Dual Representation Framework (DRF)}, a novel approach to scoliosis screening that explicitly integrates clinical priors into a deep learning pipeline. The core idea is to emulate how clinicians assess scoliosis by identifying postural asymmetries. As shown in Figure~\ref{fig:drf}(a), the framework first transforms raw pose data into two complementary representations: a continuous \emph{skeleton map} and a discrete \emph{Postural Asymmetry Vector (PAV)}. The skeleton map is processed by a feature encoder. Concurrently, the PAV, serving as a clinical prior, guides a \emph{PAV-Guided Attention (PGA)} module to generate attention weights that modulate the extracted features. Finally, the refined features are used for scoliosis assessment.
|
||||
|
||||
The following sections describe the construction of the dual pose representations (Section~\ref{dual_rep}) and the PAV-guided feature learning and assessment (Section~\ref{cpi}).
|
||||
|
||||
\subsection{Dual Pose Representations}\label{dual_rep}
|
||||
Given a sequence of 2D pose keypoints, we first normalize the data to ensure robustness to variations in subject position and scale. This involves aligning the pelvis by translating the midpoint of the hip joints to the origin, followed by height normalization to a standard height of 128 pixels. From the normalized pose, we construct two complementary representations.
|
||||
|
||||
% \noindent
|
||||
% \textit{\textbf{Skeleton Map: Continuous Gaussian Heatmaps.}}\label{skeleton_maps}
|
||||
\paragraph{\textbf{Skeleton Map: Continuous Gaussian Heatmaps.}}\label{skeleton_maps}
|
||||
To convert sparse 2D keypoints into a dense, silhouette-like representation suitable for convolutional networks, we construct a two-channel skeleton map. This approach, inspired by related works~\cite{fan2024skeletongait,duan2022revisiting}, provides a richer input than raw coordinates by capturing both local joint details and global skeletal structure. It consists of two channels:
|
||||
|
||||
\textbf{- Keypoint Map:} Encodes each joint's location and confidence with a Gaussian heatmap:
|
||||
\begin{equation}
|
||||
J(i,j) = \sum_{k} \exp\left(-\frac{(i-x_k)^2+(j-y_k)^2}{2\sigma^2}\right) \cdot c_k,
|
||||
\end{equation}
|
||||
where $(x_k, y_k)$ and $\sigma$ are the coordinates and confidence score of the $k$-th joint, respectively.
|
||||
|
||||
\textbf{- Limb Map:} Represents skeletal connections as line-like heatmaps, preserving structural integrity:
|
||||
\begin{equation}
|
||||
L(i,j) = \sum_{n} \exp\left(-\frac{d_{L2}((i,j), S[n^-, n^+])^2}{2\sigma^2}\right) \cdot \min(c_{n^-}, c_{n^+}),
|
||||
\end{equation}
|
||||
where $S[n^-, n^+]$ denotes the limb segment between joints $n^-$ and $n^+$, and $d_{L2}(\cdot)$ is the Euclidean distance from pixel $(i,j)$ to this segment.
|
||||
|
||||
|
||||
The resulting skeleton map yields a robust, continuous representation that mitigates pose estimation noise and serves as the primary input for feature extraction.
|
||||
|
||||
% \noindent
|
||||
% \textit{\textbf{PAV: Discrete Clinical Prior.}}\label{pav}
|
||||
\paragraph{\textbf{PAV: Discrete Clinical Prior.}}\label{pav}
|
||||
To explicitly incorporate clinical priors into our framework, we introduce the Postural Asymmetry Vector (PAV). Instead of relying on a few hand-picked landmarks, which are susceptible to noise and may miss subtle compensatory patterns, the PAV is designed to capture a holistic asymmetry profile. The PAV is a matrix $v \in \mathbb{R}^{P \times M}$, where $P$ denotes the number of anatomically symmetric keypoint pairs (e.g., eyes, ears, shoulders, elbows, wrists, hips, knees, and ankles in COCO17 format), with the $p$-th pair represented as $(k_p^L, k_p^R)$, and $M$ denotes the number of asymmetry metrics.
|
||||
|
||||
The PAV is constructed in two stages, as illustrated in Fig.~\ref{fig:drf}(b):
|
||||
|
||||
\textbf{1. Asymmetry Metric Computation.} For each frame, a raw matrix $\tilde{v} \in \mathbb{R}^{P \times M}$ is computed using the following three metrics for each keypoint pair:
|
||||
|
||||
- Vertical Deviation ($L_{VD}$): $|y_p^L - y_p^R|$, quantifies height differences, directly corresponding to clinical signs like shoulder and pelvic imbalance.
|
||||
|
||||
- Midline Deviation ($L_{MD}$): $\left|\frac{x_p^L + x_p^R}{2} - x_{\text{midline}}\right|$, measures lateral drift from the body's central axis ($x_{\text{midline}}$ is defined by the hip center), reflecting spinal curvature.
|
||||
|
||||
- Angular Deviation ($\alpha_{AD}$): $\left|\arctan\left(\frac{y_p^L - y_p^R}{x_p^L - x_p^R}\right)\right|$, captures the tilt of the segment connecting a keypoint pair, indicative of rotational asymmetries in the trunk and limbs.
|
||||
|
||||
|
||||
\textbf{2. Sequence-Level PAV Refinement.} The raw matrix $\tilde{v}$ is refined across the sequence to obtain a stable PAV $v$ using the following steps:
|
||||
|
||||
(1) Outlier Removal: Statistical filtering is applied independently to each element in $\tilde{v}$ using the interquartile range (IQR) method.
|
||||
|
||||
(2) Temporal Aggregation: For each metric, the mean of valid measurements across all frames is computed, summarizing temporal variation into a stable sequence-level descriptor.
|
||||
|
||||
(3) Normalization: Min-Max scaling is applied across the dataset for all $P \times M$ dimensions, standardizing values to the $[0,1]$ range and ensuring consistent feature scaling.
|
||||
|
||||
This two-stage design enables robust quantification of global postural asymmetry and allows the model to capture complex, clinically relevant inter-joint relationships. The resulting PAV acts as a clinical prior to guide the subsequent feature learning process.
|
||||
|
||||
\subsection{PAV-Guided Feature Learning and Assessment}\label{cpi}
|
||||
Our DRF uses the clinical prior encoded in the PAV to guide visual feature learning from the skeleton map. This is accomplished using a feature encoder followed by our novel PAV-Guided Attention (PGA) module.
|
||||
|
||||
% \noindent
|
||||
% \textit{\textbf{Feature Encoder.}}\label{feature_encoder}
|
||||
\paragraph{\textbf{Feature Encoder.}}\label{feature_encoder}
|
||||
First, the feature encoder processes the input skeleton map sequence to extract high-level features. For a fair comparison, we adopt the encoder architecture from ScoNet-MT~\cite{zhou2024gait}. It uses a ResNet-based backbone followed by temporal and horizontal pooling to generate a feature map $F_{\text{enc}} \in \mathbb{R}^{C \times H}$, where $C=256$ is the number of channels and $H$ is the number of horizontal body segments. This feature map $F_{\text{enc}}$ captures general postural patterns from the data.
|
||||
|
||||
% \noindent
|
||||
% \textit{\textbf{Prior-Guided Attention (PGA) Module.}}\label{pav_attention}
|
||||
\paragraph{\textbf{Prior-Guided Attention (PGA) Module.}}\label{pav_attention}
|
||||
To ensure these features are clinically relevant, we introduce the PGA Module. Unlike standard attention that relies solely on data-driven correlations, our module explicitly incorporates a clinical prior to generate attention weights. In our framework, the prior is the PAV, which guides the model to focus on features critical for scoliosis screening.
|
||||
|
||||
As shown in Fig.~\ref{fig:drf}(c), the module receives two inputs: the encoded features $F_{\text{enc}}$ and the PAV $v$. It then generates both channel-wise ($w_c$) and spatial-wise ($w_s$) attention weights using the PAV:
|
||||
\[
|
||||
w_c = \sigma(W_c \cdot v), \quad w_s = \sigma(W_s * v),
|
||||
\]
|
||||
where $\sigma$ denotes the sigmoid function, $W_c$ is a learnable linear layer, and $W_s$ is a learnable 1D convolution. The resulting attention weights, $w_c \in \mathbb{R}^{C \times 1}$ and $w_s \in \mathbb{R}^{1 \times H}$, are used to recalibrate the encoded features via element-wise multiplication:
|
||||
\[
|
||||
F_{\text{out}} = F_{\text{enc}} \odot w_c \odot w_s,
|
||||
\]
|
||||
where $\odot$ denotes element-wise multiplication with broadcasting. This produces a refined feature map $F_{\text{out}}$ that integrates visual patterns with clinical priors, yielding a more informative representation for downstream assessment.
|
||||
|
||||
% \noindent
|
||||
% \textit{\textbf{Scoliosis Assessment and Training Objective.}}
|
||||
\paragraph{\textbf{Scoliosis Assessment and Training Objective.}}
|
||||
The refined features $F_{\text{out}}$ are passed to a classification head with fully connected layers to produce the final logits for scoliosis assessment. The entire framework is trained end-to-end with a combined loss function, following the approach of ScoNet-MT~\cite{zhou2024gait}:
|
||||
\[
|
||||
L_{total} = L_{ce} + L_{triplet}.
|
||||
\]
|
||||
|
||||
\begin{table}[t] % 使用 H 参数
|
||||
\centering
|
||||
\begin{minipage}[!h]{0.58\textwidth}
|
||||
\caption{Comparison with state-of-the-arts on Scoliosis1K dataset. The best results are in bold. ScoNet-MT$^{ske}$ refers to ScoNet-MT~\cite{zhou2024gait} adapted for skeleton map input.}
|
||||
\label{tab:main_results}
|
||||
\centering
|
||||
\begin{threeparttable}
|
||||
\resizebox{0.94\textwidth}{!}{
|
||||
\begin{tabular}{c|c|c|ccc}
|
||||
\toprule
|
||||
&
|
||||
&
|
||||
&
|
||||
\multicolumn{3}{c}{Macro-average} \\ \cline{4-6}
|
||||
\multirow{-2}{*}{Input} &
|
||||
\multirow{-2}{*}{Method} &
|
||||
\multirow{-2}{*}{\begin{tabular}[c]{@{}c@{}}Total\\ Accuracy\end{tabular}} &
|
||||
Prec &
|
||||
Rec &
|
||||
F1 \\ \hline
|
||||
Silhouette &
|
||||
ScoNet-MT~\cite{zhou2024gait} &
|
||||
82.0 &
|
||||
84.0 &
|
||||
75.9 &
|
||||
75.4 \\ \hline
|
||||
&
|
||||
OF-DDNet~\cite{lu2020vision} &
|
||||
74.1 &
|
||||
68.6 &
|
||||
68.4 &
|
||||
67.0 \\
|
||||
\multirow{-2}{*}{\begin{tabular}[c]{@{}c@{}}Pose\\ Coordinates\end{tabular}} &
|
||||
GPGait~\cite{fu2023gpgait} &
|
||||
74.9 &
|
||||
69.2 &
|
||||
68.9 &
|
||||
67.5 \\ \hline
|
||||
&
|
||||
ScoNet-MT$^{ske}$ &
|
||||
82.5 &
|
||||
81.4 &
|
||||
74.3 &
|
||||
76.6 \\
|
||||
\multirow{-2}{*}{Skeleton Map} &
|
||||
\cellcolor[HTML]{E5E5E5} DRF (ours) &
|
||||
\cellcolor[HTML]{E5E5E5}\textbf{86.0} &
|
||||
\cellcolor[HTML]{E5E5E5}\textbf{84.1} &
|
||||
\cellcolor[HTML]{E5E5E5}\textbf{79.2} &
|
||||
\cellcolor[HTML]{E5E5E5}\textbf{80.8} \\ \hline
|
||||
\end{tabular}}
|
||||
\end{threeparttable}
|
||||
\end{minipage}%
|
||||
\begin{minipage}[!h]{0.40\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=0.8\textwidth]{figures/heatmap.pdf}
|
||||
\captionsetup{type=figure}
|
||||
\caption{Visualization of feature response heatmaps. (a) ScoNet-MT. (b) ScoNet-MT$^{ske}$. (c) Our DRF.}
|
||||
\label{fig:heatmap}
|
||||
\end{minipage}
|
||||
\end{table}
|
||||
|
||||
\section{Experiments}
|
||||
|
||||
\subsection{Setup}
|
||||
% \textbf{Evaluation Protocol.}
|
||||
\paragraph{\textbf{Evaluation Protocol.}} We use the standard dataset split from~\cite{zhou2024gait} (745 training, 748 test sequences) with real-world class distribution (positive:neutral:negative = 1:1:8). To address class imbalance and enable fair comparison, we report overall accuracy along with macro-averaged precision, recall, and F1-score.
|
||||
|
||||
% \noindent
|
||||
% \textbf{Implementation Details.}
|
||||
\paragraph{\textbf{Implementation Details.}} Our method is implemented using PyTorch~\cite{paszke2019pytorch} and OpenGait~\cite{opengait}. Following ScoNet-MT~\cite{zhou2024gait}, we retain the same network architecture and training protocol, with two key modifications: (1) the Conv0 layer is adapted to process skeleton maps instead of silhouettes, and (2) the proposed PGA module is integrated to guide feature extraction.
|
||||
|
||||
\subsection{Comparison with State-of-the-Art Methods}
|
||||
We perform comprehensive comparisons with state-of-the-art methods using various input representations. For silhouette-based methods, we compare our model with ScoNet-MT~\cite{zhou2024gait}. For pose coordinate-based approaches, we include OF-DDNet~\cite{lu2020vision} and GPGait~\cite{fu2023gpgait}. OF-DDNet is adapted to process 2D pose inputs to ensure fair comparison.
|
||||
|
||||
|
||||
% \noindent
|
||||
% \textbf{Quantitative Results.}
|
||||
\paragraph{\textbf{Quantitative Results.}} As shown in Table~\ref{tab:main_results}, our proposed DRF achieves state-of-the-art performance across all evaluation metrics. Pose coordinate-based methods, such as GPGait and OF-DDNet, yield the lowest performance, with F1-scores of 67.5\% and 67.0\%, respectively. Image-based representations perform better overall. The silhouette-based ScoNet-MT model significantly outperforms pose coordinate-based methods. When adapted to use our skeleton map input (ScoNet-MT$^{ske}$), the model achieves an additional 0.5\% improvement in accuracy and a 1.2\% gain in F1-score. These results confirm that the skeleton map captures rich structural information. Importantly, integrating the PGA module with PAV enables our final DRF model to significantly outperform the ScoNet-MT$^{ske}$ baseline. Specifically, accuracy increases by 3.5\%, and the F1-score improves by 4.2\%. This significant improvement demonstrates that explicitly guiding the model with clinical priors on postural asymmetry is highly effective for scoliosis screening.
|
||||
|
||||
% \noindent \textbf{Qualitative Results.}
|
||||
\paragraph{\textbf{Qualitative Results.}} As shown in
|
||||
Figure~\ref{fig:heatmap}, we visualize the intermediate feature responses using the visualization technique from~\cite{zhou2016learning}. ScoNet-MT with silhouette input (Fig.\ref{fig:heatmap} (a)) focuses on isolated regions (e.g., head and shoulders), missing holistic structural relationships crucial for scoliosis screening. With skeleton map as input, ScoNet-MT$^{ske}$ (Fig.\ref{fig:heatmap} (b)) preserves structural information but mainly attends to one body side. By incorporating our PGA module guided by the PAV, our DRF (Fig.~\ref{fig:heatmap} (c)) more effectively attends to key regions on both sides of the body (e.g., imbalanced shoulders and pelvic), which strongly corresponds to our clinical prior integration objective. These visualizations confirm that our PGA module, when guided by PAV, effectively guides feature extraction using clinical asymmetry cues, creating the synergistic relationship between our dual representations.
|
||||
|
||||
|
||||
|
||||
\begin{table*}[t]
|
||||
\caption{Ablation studies on Scoliosis1K dataset. The best results are in bold.}
|
||||
\centering
|
||||
\begin{minipage}{0.52\linewidth}
|
||||
\centering
|
||||
\resizebox{0.7\linewidth}{!}{
|
||||
\begin{tabular}{cc|cccc}
|
||||
\hline
|
||||
& & & \multicolumn{3}{c}{Macro-average} \\ \cline{4-6}
|
||||
\multirow{-2}{*}{Channel} & \multirow{-2}{*}{Spatial} & \multirow{-2}{*}{\begin{tabular}[c]{@{}c@{}}Total\\ Acc\end{tabular}} & Prec & Rec & F1 \\ \hline
|
||||
& & 82.5 & 81.4 & 74.3 & 76.6 \\
|
||||
$\checkmark$ & & 79.7 & 80.9 & 75.7 & 74.9 \\
|
||||
& $\checkmark$ & 81.7 & 77.1 & \textbf{80.8} & 78.1 \\
|
||||
|
||||
$\checkmark$ & $\checkmark$ & \textbf{86.0} & \textbf{84.1} & 79.2 & \textbf{80.8} \\ \hline
|
||||
\end{tabular}
|
||||
}
|
||||
\subcaption{Component Analysis of PGA Module. All variants of the PGA module are guided by our PAV.}
|
||||
\label{tab:pga_ablation}
|
||||
\end{minipage}
|
||||
\hfill
|
||||
\begin{minipage}{0.44\linewidth}
|
||||
\centering
|
||||
\resizebox{0.8\linewidth}{!}{
|
||||
\begin{tabular}{c|cccc}
|
||||
\hline
|
||||
\multirow{2}{*}{Guidance Source} & \multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}Total\\ Acc\end{tabular}} & \multicolumn{3}{c}{Macro-average} \\ \cline{3-5}
|
||||
& & Prec & Rec & F1 \\ \hline
|
||||
Self-attention & 81.4 & \textbf{84.5} & 73.9 & 75.3 \\ \hline
|
||||
All-Ones & 82.2 & 80.8 & 74.3 & 74.7 \\
|
||||
Random & 75.4 & 82.2 & 62.7 & 57.9 \\
|
||||
Learnable & 69.4 & 84.2 & 53.2 & 56.4 \\
|
||||
PAV (ours) & \textbf{86.0} & 84.1 & \textbf{79.2} & \textbf{80.8} \\ \hline
|
||||
\end{tabular}
|
||||
}
|
||||
\subcaption{Comparison of PAV and alternative guidance sources for the PGA Module.}
|
||||
\label{tab:pav_ablation}
|
||||
\end{minipage}
|
||||
\end{table*}
|
||||
\subsection{Ablation Studies}\label{sec:ablation}
|
||||
To validate the design of our framework, we conduct two sets of ablation studies. First, we analyze the effectiveness of the PGA module and its components. Second, we verify that the proposed PAV provides superior guidance compared to alternative strategies.
|
||||
|
||||
|
||||
% \noindent
|
||||
% \textbf{Effectiveness of the PGA Module.}
|
||||
\paragraph{\textbf{Effectiveness of the PGA Module.}} We first evaluate the overall contribution of the PGA module, followed by an analysis of its channel-wise and spatial-wise attention branches. We use ScoNet-MT$^{ske}$ as our baseline, which operates without any attention guidance. As shown in Table \ref{tab:pga_ablation}, integrating the full PGA module (guided by PAV) into the baseline significantly improves performance, with a 3.5\% increase in accuracy and a 4.2\% gain in F1-score. Furthermore, both attention branches are essential. Disabling either the channel or spatial branch results in a notable performance decline compared to the full model. This highlights the synergistic interaction between channel and spatial attention, which is essential for effectively leveraging the clinical cues provided by the PAV.
|
||||
|
||||
% \noindent
|
||||
% \textbf{Effectiveness of PAV as a Clinical Prior.}
|
||||
\paragraph{\textbf{Effectiveness of PAV as a Clinical Prior.}}
|
||||
After confirming the effectiveness of the PGA module’s architecture, we next examine the importance of the guidance source. We compare the PAV with several alternative guidance vectors within the same PGA module architecture. The evaluated alternatives include:
|
||||
|
||||
(1) \textit{No External Prior (Self-attention)}: Guidance is generated directly from feature maps, following a standard self-attention mechanism. This setup evaluates whether the performance gain arises solely from the attention mechanism rather than from clinical prior.
|
||||
|
||||
(2) \textit{Fixed, Uninformative Priors}: Fixed vectors with the same dimensionality as the PAV are used: (a) All-Ones, representing uniform attention, and (b) Random, representing arbitrary, unstructured guidance. These variants test whether a static structural bias could achieve similar benefits.
|
||||
|
||||
(3) \textit{Learnable Prior}: The PAV is replaced by a learnable parameter vector, enabling the network to derive a data-driven guidance signal without explicit clinical constraints.
|
||||
|
||||
As shown in Table \ref{tab:pav_ablation}, using PAV to guide the PGA module significantly outperforms all alternatives. The self-attention approach results in a 5.5\% drop in F1-score, indicating that the explicit prior offers greater value than internally derived correlations. The uninformative priors (All-Ones, Random) and the learnable prior also perform significantly worse, with F1-score reductions ranging from 6.1\% to 22.9\%. These results confirm that the performance improvement arises not merely from using an attention module, but crucially from the rich, clinically relevant information encoded in the PAV. This underscores the value of incorporating explicit clinical prior for robust scoliosis screening.
|
||||
|
||||
\section{Conclusion}
|
||||
This study proposes a novel, clinically informed approach for pose-based scoliosis screening. The proposed Dual Representation Framework (DRF), combined with the Scoliosis1K-Pose dataset, integrates continuous skeletal structures and clinically informed asymmetry descriptors to improve model accuracy and interpretability, enabling accessible and privacy-preserving scoliosis screening. We hope that the work can inspire more research on scoliosis screening using massive video data.
|
||||
|
||||
% \begin{comment}
|
||||
|
||||
% The following acknowledgement and disclaimer sections should be removed for the double-blind review process.
|
||||
% If and when your paper is accepted, reinsert the acknowledgement and the disclaimer clause in your final camera-ready version.
|
||||
|
||||
\begin{credits}
|
||||
\subsubsection{\ackname} This work was supported by the National Natural Science Foundation of China (Grant 62476120) and the Scientific Foundation for Youth Scholars of Shenzhen University (Grant 868-000001033383).
|
||||
\subsubsection{\discintname}
|
||||
The authors have no competing interests to declare that are relevant to the content of this article.
|
||||
|
||||
\end{credits}
|
||||
|
||||
%
|
||||
% ---- Bibliography ----
|
||||
%
|
||||
% BibTeX users should specify bibliography style 'splncs04'.
|
||||
% References will then be sorted and formatted in the correct style.
|
||||
%
|
||||
% \bibliographystyle{splncs04}
|
||||
% \bibliography{mybibliography}
|
||||
%
|
||||
\bibliographystyle{splncs04}
|
||||
\bibliography{refs}
|
||||
\end{document}
|
||||
Reference in New Issue
Block a user