## Abstract

Analyzing anatomic shapes of tissues and organs is pivotal for accurate disease diagnostics and clinical decision-making. One prominent disease that depends on anatomic shape analysis is osteoarthritis, which affects 30 million Americans. To advance osteoarthritis diagnostics and prognostics, we introduce *ShapeMed-Knee*, a 3D shape dataset with 9,376 high-resolution, medical-imaging-based 3D shapes of both femur bone and cartilage. Besides data, ShapeMed-Knee includes two benchmarks for assessing reconstruction accuracy and five clinical prediction tasks that assess the utility of learned shape representations. Leveraging ShapeMed-Knee, we develop and evaluate a novel hybrid explicit-implicit neural shape model which achieves up to 40% better reconstruction accuracy than a statistical shape model and implicit neural shape model. Our hybrid models achieve state-of-the-art performance for preserving cartilage biomarkers; they’re also the first models to successfully predict localized structural features of osteoarthritis, outperforming shape models and convolutional neural networks applied to raw magnetic resonance images and segmentations. The ShapeMed-Knee dataset provides medical evaluations to reconstruct multiple anatomic surfaces and embed meaningful disease-specific information. ShapeMed-Knee reduces barriers to applying 3D modeling in medicine, and our benchmarks highlight that advancements in 3D modeling can enhance the diagnosis and risk stratification for complex diseases. The dataset, code, and benchmarks will be made freely accessible.

## I. Introduction

Osteoarthritis (OA) is the leading cause of pain and disability in developed countries, impacting 30.8 million US adults with an annual US cost of $180 billion [2]. OA affects all tissues in a joint, with emphasis on bone and cartilage. The majority of deep learning research in OA focuses on 2D convolutional neural networks (CNNs) applied to X-rays, 2D and 3D CNNs for segmentation of magnetic resonance images (MRI), and few studies using 3D CNNs for classification of MRIs [3], [4], [5], [6]. OA research largely focuses on X-rays due to the limitations of efficiently processing large 3D image volumes, however, X-rays are a 2D projection of the joint and are thus prone to errors, particularly with repositioning [7].

Characterizing OA relies on medical imaging to discern the shape of anatomic tissues [8]. As OA progresses, osteophytes grow at the edges of cartilage, and cartilage is thinned. Diagnosis of OA is based on these shape features [8]. Beyond OA, shape analysis also serves as the basis for numerous health conditions and diagnoses. For example, shape modeling is crucial for diagnosis and treatment of craniosynostosis, a pediatric condition where skull bones fuse early, causing deformity and potential brain damage [9]. Numerous orthopedic conditions are related to bone shape; both gross shape [10], [11] and nuanced curvatures of joint articulations [12] are important for diagnosing, treating, and preventing disease.

Shape modeling provides an efficient way to analyze 3D anatomic data [13]. However, current shape models, and shape model research has limitations. Widely adopted statistical shape models (SSMs) require anatomic point matching, which is not guaranteed and, in disease, may not be possible. For example, osteophytes that form in OA are not present in healthy bones, and thus no true matching points exist. Once matching points are obtained, SSMs are typically fit using linear statistical representations, namely principal components analysis (PCA); shape features of disease are unlikely to be purely linear in nature. Applications of SSMs in medicine are typically used to identify gross features or predict disease in general [14], [15]; accurate quantification of specific, localized, biomarkers of disease are required for clinical applications. To advance shape analysis in medicine, we require benchmarks that assess clinically relevant reconstruction metrics, and whether a model can localize relevant disease features.

With our overarching objective to enable the advancement of medical domain-specific 3D modeling, we provide the following contributions (Fig. 1):

We introduce

: a 3D anatomic dataset with 9,376 shapes, each including two interrelated objects (femur bone and cartilage). We publicly share segmentation masks, and 3D shapes.*ShapeMed-Knee*We define seven medically relevant benchmark tasks with our ShapeMed-Knee dataset: surface reconstruction, cartilage biomarker calculation from reconstructions, disease diagnosis, localized disease staging, and future surgical event prediction.

We develop hybrid explicit-implicit neural shape models (NSM) that outperform both SSMs and implicit NSMs for bone and cartilage reconstruction (7-20% lower average symmetric surface distance).

We demonstrate that our hybrid NSM outperforms an SSM, implicit NSM, and CNN in disease staging, disease diagnosis, and localization of specific features of disease.

We show that interpolation in NSM latent space produces interpretable smooth interpolation of physical shape, clinical shape features, and clinical predictions.

We demonstrate precise control over localized disease features by interpolating latent space along classifier-fitted vectors, enabling targeted manipulations of disease characteristics.

We publicly share our NSM model and the code used for training and inference. A tutorial on how to download and used the data is provided at https://github.com/gattia/shapemedknee.

## II. Related Work

Neural representations have advanced computer graphics [16]. ShapeNet data has been central to the advancement of generative 3D shape models [17]. The recently proposed Med-ShapeNet is similar to ShapeNet, but includes 3D anatomic shapes with multiple inter-related tissues [18]. However, there still exists a gap in 3D anatomic models with curated disease-specific reconstruction metrics and clinical tasks; these data are needed to enable focused research that advances methods for quantifying anatomic shapes and understanding how these shapes influence health and disease.

### A. Generative Implicit Neural Representations

DeepSDF was the first reported use of a generative implicit neural representation [19]. DeepSDF uses a multilayer perceptron (MLP) to generate shapes conditioned on a latent vector *z*. DeepSDF enables shape compression, interpolation, and completion from partial observations. Numerous DeepSDF advances have been proposed. Curriculum DeepSDF using curriculum learning [20]. Modulated Periodic Activations combine two MLPs as a means of leveraging periodic (sinusoidal) activations, which outperformed rectified linear unit (ReLU) MLPs for single object reconstruction [21], [22].

To improve reconstruction of large scenes or fine details, instead of a single global *z*, a spatially localized *z* is input into the MLP [23], [24]. Hybrid explicit-implicit formulations generate localized *z* by leveraging the expressivity of CNNs [24], [25], [26], [27]. Both generative adversarial network and variational autoencoder (VAE) frameworks have been used in these hybrid explicit-implicit models [25], [26].

### B. Shape Modeling

Shape modeling has many important applications for biomedical data. In just the OA community, shape models have been used for automated segmentation [28], [29], disease prediction and staging [15], [30], [31], and generating synthetic data for physics-based simulations [12], [32]. Shape models have advanced understanding and treatment of conditions related to the heart, brain, skull, and bones, to name a few [33], [9], [34], [10], [11]. Improved shape modeling can benefit all of these areas, providing tangible benefits in understanding disease and improving patient health.

### C. Statistical Shape Models

Conventional SSMs use PCA to learn shape features. The main challenge with PCA-based SSMs for anatomical objects is the need for matching points at the same anatomical location on each object. Correspondence is typically obtained via non-rigid image registration of signed distance fields [28], or non-rigid point cloud registration [14], [12], [35]. To improve anatomic correspondence, registration features beyond XYZ coordinates, such as spectral coordinates or curvatures have been included [14], [36]. Registration is prone to failure in abnormal or diseased areas, which are typically the most important.

### D. Neural Shape Models

We refer to generative shape models in the medical domain as NSMs. There are only a handful of NSM applications. Amiranashvili et al. fit an occupancy NSM to anisotropic bone data showing occupancy-based methods can be trained and applied to undersampled anisotropic data. However, the occupancy NSMs still exhibit relatively large reconstruction errors (average symmetric surface distance (ASSD): 0.25-0.48mm) [37]. Jensen et al. fit a NSM by deforming points on a sphere using point-specific latent vectors. During training, a single latent vector was used for all points, while during inference, latents vary over the surface to increase expressivity. They showed better reconstruction than DeepSDF and improved segmentation results [38]. Ludke et al. used a neural flow deformer to fit a NSM by deforming coordinates from a template shape to the target, outperforming a conventional SSM in terms of surface reconstruction and simple OA classification [39].

Biomedical research demonstrates that implicit neural representations applied as NSMs improve anatomical reconstructions and image segmentation results and can encode basic clinical information. However, existing work represents only a single tissue at a time, uses relatively small samples of data (41-354 examples), and primarily focuses on surface reconstruction results rather than the quality of learned representations. Finally, biomedical approaches are challenging to compare as they use different datasets and downstream prediction tasks.

## III. Dataset & Evaluation

Data from this study is derived from the Osteoarthritis Initiative (OAI), a multi-center, longitudinal observational study of 4,796 men and women (45-79 years of age) with the goal of developing biomarkers of OA. The OAI collected patient clinical data, X-rays, and MRIs annually for 9 years. Important for the prediction tasks in this study, teams of expert radiologists were contracted to label acquired images for OA diagnosis, as well as standardized features of OA disease. We derive our dataset from the MR imaging data collected at the baseline time point and the radiologist evaluations from the baseline and all follow-up time points.

We used stratified random sampling to split the OAI baseline data into train/validation/test sets at the subject level, as right and left joints can be highly correlated and provide a form of data leakage. Splits were stratified over sex and clinical prediction tasks (III) to ensure disease states and outcomes were equally represented. Due to the iterative and time-consuming nature of fitting the NSM during inference, a small validation set was used in this study (train: 67.5%, 3,233 people and 6,325 knees; validation: 2.5%, 74 people and 141 knees; test: 30.0%, 1,481 people and 2,910 knees). Tab. I contains an overview of the included data.

### A. ShapeMed-Knee Dataset Creation

#### 1) Segmentations & Surfaces

We extracted 9,376 Double Echo in Steady State (DESS) knee MRIs from the baseline visit of participants in the OAI [40]. We segmented DESS MRIs automatically using a multi-stage CNN framework; this approach was validated on the OAI dataset, achieving Dice similarity coefficients of 0.99 and 0.91 for femoral bone and cartilage and low ASSD (0.08-0.15mm) [41]. This performance is equivalent to the best-reported cartilage segmentations [6], [29], and is the same as expert-human level in terms of cartilage sensitivity to change [42]. All left knee MRI segmentations were flipped to create right knees and remove variance due to anatomical side. Three-dimensional surfaces were then generated from each femur bone and cartilage segmentation mask using previously established methods [35]; code to create surface meshes is shared for reproducibility.

##### a) Cartilage Thickness Biomarker

Mean cartilage thickness in pre-defined anatomic regions is a common biomarker for clinical trials and experimental studies [43], [44]. It is critical that NSM-reconstructed surfaces preserve these biomarkers relative to reference surfaces [45]. We calculated cartilage biomarkers with the following processing steps: i) divide cartilage segmentations into subregions, ii) compute cartilage thickness for each vertex over the bone surface, iii) assign each bone-vertex to one of the subregions. Cartilage biomarker calculations used open-source code [46] used in previous investigations [35], [47]. From these data, we computed five cartilage thickness biomarkers as the mean thickness for all bone mesh vertices in each of five established cartilage subregions (trochlea, medial central, lateral central, medial posterior, lateral posterior) [48]. Visualization of cartilage thickness, subregions, and a general orientation to the data are presented in Fig. 2.

##### b) Bone Surface Registration

All femur bones were co-registered to have matching points to create a traditional SSM (Sec. IV) as a baseline model; original full resolution meshes (∼ 220,000 points) were used for the NSMs. First, to reduce the computational complexity of the registration, each bone mesh was downsampled to 20,000 vertices [49], [50]. Next, an average femur shape, determined from 281 knees in a prior study [51], was used as the template and non-rigidly registered to every other bone in the dataset using spectral correspondence-based registration [52], [36] that has been used in multiple knee OA studies [35], [14]. Cartilage thickness and subregions were re-calculated for the registered meshes as described in the previous section Sec. III-C.2. The resulting registered meshes included matching points and cartilage thicknesses for 9,376 femur bones.

##### c) Mesh Quality Control

To ensure high-quality meshes in the dataset, we generated static images of every bone mesh from 4 orthogonal planes (top, bottom, front, back) using pyVista [53] and an imaging researcher with 10 years of experience with bone analysis manually reviewed every image. From this analysis, we identified 57 meshes (0.6%) with large errors primarily due to physiologically-plausible holes at the sites of anterior cruciate ligament reconstruction. These 57 meshes were removed from the dataset. An additional 9 meshes had moderate errors, and 30 meshes had small potential errors; these meshes were retained in the dataset. IDs for moderate and small error meshes, and quality control images for all knees are provided for dataset users to use their custom exclusion criteria.

### B. Prediction Tasks

OA is a whole joint disease that affects multiple tissues, with an emphasis on the cartilage and bone. We developed five prediction tasks which test a model’s ability to understand shape complexity relevant to current bone and cartilage health as well as future disease progression.

OA is commonly diagnosed using X-rays graded using the Kellgren-Lawrence (KL) system [8]. The KL system assigns knees a grade between 0-4 (0 = no OA, 1 = doubtful OA, 2 = mild OA, 3 = moderate OA, 4 = severe OA). Diagnosis with OA is defined as *KL* ≥ 2. Beyond diagnosis, KL grading is used in research and clinical trials to “stage” the severity of OA in the whole joint (all tissues/bones) beyond binary classification. Therefore, our first two tasks are:

**General OA staging**by predicting KL grade (0-4)**Binary OA diagnosis**(*KL*≥ 2)While KL grading provides a whole-joint OA measure, it is a coarse measurement based on 2D X-rays and does not provide fine-grained, location-specific information in 3D. Therefore, it cannot be used to identify where and what tissues are involved in a person’s disease. The MRI Osteoarthritis Knee Score (MOAKS) measures multiple features of OA that are localized to different regions of the joint [54]. Our third task involves predicting three MOAKS scores (one bone and two cartilage features) in six distinct regions of the femur. MOAKS scoring provides clinically important information and can simultaneously serve as a test of how well a model can spatially localize fine-grained OA features. Task three is:

Advanced localized OA staging by predicting three MOAKS scores (Score 1: Osteophytes, Score 2: Cartilage Thinning, Score 3: Cartilage Hole) in 6 femoral regions divided across the anterior, central, and posterior regions in the medial and lateral condyles.

The three MOAKS scores were defined as follows:

: Osteophytes are abnormal bone growths (bone spurs) that occur at the edges of the cartilage and are a hallmark sign of OA. The MOAKS osteophyte score includes 4 levels (0: None, 1: small, 2: medium, 3: large). Due to a low prevalence of grade 3 scores (< 5%), we binned MOAKS osteophyte score into 3 levels (0-2) where level 2 includes original scores of 2/3.*Score 1 Osteophytes*: A key sign of OA is cartilage thinning. The MOAKS cartilage thinning score categorizes the % of a region with any cartilage thinning into 4 categories. Given a class imbalance amongst the four categories, we binarize this score as individuals with < 10% thinning (grades 0/1) and > 10% thinning (grades 2/3). This approach is used in prior OA studies [55].*Score 2 Cartilage Thinning*: The final score quantifies the % of a region that has a full thickness defect (a hole) in the cartilage into the same 4 levels (0-4) as cartilage thinning. Cartilage holes rarely occur (6-16%), thus we binarized this score into no hole (grade 0) and any hole (≥ 1).*Score 3 Cartilage Hole*

The final two tasks were created to test whether a model can predict future OA diagnosis (within 4 years) in currently healthy subjects, and whether a medical event (knee replacement) has occurred (within 9 years). Future OA diagnosis and knee replacement prediction are common tasks performed in the OA literature, are challenging, and would provide valuable information to identify which patients should be treated earlier. MRI-based SSMs of bone shape, and CNN’s applied to X-ray data have previously been used to predict these outcomes [15], [30], [56], [57].

4)

**Predict future disease (OA)**within 4 years.5)

**Predict future knee replacement**within 9 years.

### C) Evaluations

#### 1) Surface Reconstruction

We evaluate surface reconstruction errors separately for the bone and cartilage surfaces using ASSD. We test ASSD on the whole test set and separately for the 5 KL grades to assess whether reconstruction errors depend on disease state.

#### 2) Cartilage Thickness Biomarker

To evaluate whether reconstructed bone and cartilage surfaces preserve important cartilage biomarkers, we analyze the five cartilage subregions on the whole test set and on each of the 5 KL grades in the test set. Between the mean thickness of the original and reconstructed surfaces we compute 1) the root mean squared error (RMSE↓) to determine absolute errors and 2) the standard deviation of the difference (*SDD* ↓) as a measure of consistency that removes the effect of systematic bias.

#### 3) Prediction Tasks

. OA staging is quantified using the KL grade, a semi-quantitative multi-class measure of OA with variation between raters. As such, relative agreement is commonly used to assess KL predictions and inter-rater agreement. We use accuracy and quadratically-weighted Cohens Kappa, as done previously [58], [59], [60].*OA Staging*. As OA diagnosis is a binary prediction task with relatively well-balanced groups, we compute the common metrics of area under the receiver operating characteristic curve (AUROC) and accuracy.*OA Diagnosis*. We assess three MOAKS scores (measuring osteophytes, cartilage thinning, cartilage holes) separately for six regions of interest. Score 1 (osteophytes) includes three classes, and thus we compute quadratically weighted Kappa and accuracy. Since both Score 2 (cartilage thinning) and Score 3 (cartilage hole) are binary tasks with large class imbalance, we compute F1 score and the area under the precision-recall curve (AUPRC).*Advanced OA Staging (MOAKS)*. The incidence of OA in the four years following baseline was relatively rare, occurring in only 9% of subjects. Therefore, we compute the F1 score and AUPRC.*Future disease (OA)*. The incidence of knee replacement in the 9 year follow-up was rare (5%). Therefore, we compute the F1 score and AUPRC.*Future knee replacement surgery*

## IV. Benchmark Models

We compared multiple types of shape models and CNNs on our tasks. We compare an SSM, implicit NSM, and our hybrid explicit-implicit NSM for reconstruction tasks. In addition to these models, for the prediction tasks, we also compare 3D CNNs applied to raw image data and to bone/cartilage segmentations. The models are described in the following.

### A. Neural Shape Models

DeepSDF-based NSMs train a decoder to take as input a latent vector *z* and coordinate *x* and predict the signed distance *s* of *x*. NSMs typically use an autodecoder framework where *z* is learned by jointly optimizing a dictionary of latents along with the network weights to predict *s* while using regularization so *z* matches a multivariate Gaussian distribution. Both NSMs used in this study were trained using the same framework, including point sampling, training hyperparameters, and reconstruction strategy.

#### a) Point Sampling

Before training, an arbitrary mesh was chosen as the reference. Every other bone mesh was registered to the reference using a similarity transform (rigid + scale); the transform was applied to the coinciding cartilage surface. Next, bone and cartilage meshes were centred using the mean of the bone points and were normalized using maximum radial distance so both tissues lie within a unit sphere. Then, separately for the bone and cartilage surfaces, 500,000 points were sampled. Ninety percent of points were randomly sampled by first sampling positions on the surface using blue noise to produce uniform random samples. Then, sampled surface points were perturbed by adding zero mean Gaussian noise: 45% *σ* = 0.016; 45% *σ* = 0.05. The remaining 10% of points were uniformly sampled over the unit cube. Finally, *s* from both meshes was calculated for every sampled point.

#### b) Training

Prior to training, each bone/cartilage pair was assigned a random *z* ∼ 𝒩 (0, 0.01^{2}). During training, for each subject (*k*) and surface type (*j*: bone/cartilage), 17,000 points (*X*_{jk}) were randomly sampled with equal numbers of points inside (-) and outside (+) the surface. Eqn. (1) was optimized to minimize the error in predicted *s* and to regularize the latent *z*. The loss comprises a reconstruction and latent regularization term. The reconstruction term penalizes hard samples (predicted wrong sign) as shown in Eqn (2) and includes a weighted ℒ_{1} where *λ* (0-1) controls the weighting on hard samples with *λ* = 0 being equivalent to regular ℒ_{1} and higher values provide greater penalty [20]. *λ* was exponentially increased from 0 to 0.2 over the first 1800 epochs. A latent regularization loss independently penalized each *z* component with *σ* = 100 to promote diagonal covariance. Latents and network weights *f*_{θ} were jointly optimized using the AdamW optimizer with weight decay of 1e-4 [61].

#### c) Reconstruction

To reconstruct surfaces and create shape-specific latents, the NSM weights were frozen and the NSM was fit to the new surfaces. Specifically, the bone to be reconstructed was similarity registered to the mean bone shape of the NSM (zero-vector) and the bone/cartilage surfaces were scaled to be within a unit sphere. Then, a randomly initialized latent *z* ∼ 𝒩 (0, 0.01^{2}) was optimized for 2,000 epochs to reconstruct the surfaces using an ℒ_{1} loss between the network predicted signed distance *s* and the actual *s* of 20,000 randomly sampled surface points (*s* = 0) using the Adam optimizer. The lr was decayed by a factor of 0.9 every 20 epochs, and early stopping was implemented with a patience of 50 epochs.

#### d) Hybrid Explicit Implicit NSM

The hybrid NSM is based on triplanar architectures [25], [26] as outlined in Fig. 4. A global latent *z* of a length of 512 is processed via a fully connected layer, resulting in a 2048-length vector. This vector is then reshaped to be 2 ×2 ×512 before being input into a CNN decoder. The CNN decoder had 5 2D transpose convolution layers, with stride 2 and 512 channels as outputs at each layer. The final output layer of the CNN was sized 64 ×64 ×384; the 384 features maps were split into 128 features per orthogonal plane. Sampled points *x* ∈ ℝ ^{3} are projected onto the three orthogonal planes, and a length 128 *z* was obtained per feature plane via bilinear interpolation. Plane features were combined via summation, yielding a length 128 local *z*. The local *z* and the sampled *x* position were concatenated and input into the implicit 3-layer MLP with width 512, *ReLU* activations, and a length two output (one for each tissue) with a tanh activation.

#### e). Implicit NSM Network

The implicit decoder was an 8-layer MLP of width 512, with a skip connection of the inputs (*x* and *z*) to layer 4 and *ReLU* activations throughout. The output was sized two and used the tanh activation.

### B. Statistical Shape Model

The SSM was fit using [46], the same as described in previous investigations [35], [14]. SSM-based reconstruction does not provide explicit cartilage surfaces but instead computes thicknesses at each bone vertex, therefore ASSD was not evaluated for SSM cartilage.

### C. Convolutional Neural Network

We trained two DenseNet121 models as implemented in the MONAI package [62]. One network was trained with an input of the raw DESS MRI data and the other an input of the bone/cartilage segmentations. For both variants, the 3D volumes used for input were downsampled from the original volumes (384 × 384 × 160) to be sized 384 × 384 × 80, using bilinear interpolation. This approach preserved full-resolution data in-plane, while reducing slice thickness to 1.4mm, which is sufficient for clinical trials including quantitative cartilage analyses [44]. CNNs were trained with the AdamW optimizer, an initial learning rate of 10^{−5} exponential decay with gamme=0.8 and weight decay=0. Training was performed with a single Nvidia A6000 GPU.

## V. Experiments

### A. Reconstructions

Reconstruction evaluations are provided for the SSM, implicit NSM, and hybrid NSM. No reconstruction results are provided for the CNN because it is not generative.

#### a) Dataset Size

To determine data efficiency, we trained each shape model using 4 training set sizes: 50, 200, 1,000, 6,325. NSMs were trained for 2,000 epochs (Tab. IV). SSMs were tested using progressively more principal components (Tab. IV). These analyses identified that: a) The hybrid NSM performed best for ASSD and both cartilage biomarker measures across dataset sizes, b) Increasing dataset size up to 6,325 increased reconstruction performance for all models, and c) Increasing the number of PCs used in SSM reconstruction did not overfit up to 1,298 PCs (99% explained variance)

The hybrid NSM better reconstructed areas of OA disease (Fig. 3). Fig. 5 distributions of ASSDs in the test set demonstrate that the hybrid NSM had better ASSD for bone (6-17%) and cartilage (9%). Tab. II shows that when assessed for all data, as well as by KL grade, the hybrid NSM had the lowest errors for reconstruction and cartilage biomarkers. Better SDD compared to RMSE indicates that all models had a small bias compared to the reference standard (Tab. II).

#### b) Latent Size

We tested the effect of doubling latent size on ASSD errors for the hybrid and implicit NSM models. Reconstruction accuracy improved as latent size increased, with the hybrid NSM ASSD dropping 26% and 22% for bone and cartilage, respectively (Tab. IV).

### B. Classification / Staging

An MLP was trained to predict each clinical evaluation task using each model’s encoded *z* as input. Hyperparameters were determined via a grid search over depth (2,3), width (64-256), dropout (0.2, 0.4), learning rate (10^{−3} to 10^{−5}), and batchsize (64-512). We also trained two 3D CNNs for clinical prediction tasks Sec. IV-C. Loss functions for CNNs and MLPs included binary cross entropy (OA, MOAKS cartilage thinning and hole, future OA and knee replacement) and consistent rank logits ordinal regression (KL, MOAKS osteophytes) [63].

#### a) OA Staging & Diagnosis

For predicting KL, the resulting *κ* of the trained models was 0.69-0.79, with the hybrid NSM having the best performance and the implicit NSM having the worst Tab. III. All models performed comparably to inter-radiologist agreement (0.66-0.89)[59], [64], [65], [60]. Prior X-ray based DL methods performed slightly better (0.83-0.88) [58], [60].

When directly diagnosing OA, the hybrid NSM performed best (AUROC: 0.92) and the implicit NSM performed worst (Tab. III), similar to the KL task. Interestingly, the CNN applied to the segmentation and the image performed the same, indicating the raw MRI provides no additional information. Accuracy was slightly lower (0.81-0.83) than DL-based X-ray OA grading (0.87-0.90) [66], [60], likely because X-rays are the original data used to grade KL. However, the 2D X-ray projection of the joint is prone to positioning errors [7] and thus it is possible that 3D analyses are closer to the ground truth physiologic (not image-based) grading. Our CNN predictions were comparable to a previous CNN applied to MRI data for predicting OA [67].

#### b) Advanced OA staging

The hybrid NSM performed best for all three MOAKS tasks when averaged over the regions (Tab. III). These results indicate that the latent *z* fit by the NSM more meaningfully represented both the location and the size of OA features. Not only is this important for OA, but it demonstrates novel capacities of NSMs that are not commonly tested; the ShapeMed-Knee dataset provides a unique method of testing these capacities using real-world data.

The CNN models performed poorly in identifying cartilage holes (F1: 0.00-0.03) and were no better than chance for the MOAKS osteophyte tasks (*κ* 0-0.04) Tab. III. Prior DL work uses MOAKS to determine severity of cartilage damage [55]. Other work predicts other features of MOAKS, bone bruises [68] or inflammation [69]. This is the first quantification of MOAKS osteophyte and cartilage health, demonstrating that NSMs encode this important information that is currently prohibitive to obtain clinically, and costly for research and clinical trials.

#### c) Future OA & knee replacement prediction

All models performed poorly on future event prediction tasks (Tab. III), despite prior SSM bone shape work showing links between current shape and future disease [15], [31]. However, these prior studies used odds ratios to determine if certain shapes are more likely to get worse, and did not always use a test set [15]. The best-performing future OA diagnosis was by the raw image-based CNN (AUPRC: 0.20, F1: 0.26); it is possible non-shape-related features such as bone bruises or joint inflammation boosted CNN image performance [70].

### C. Interpretability

One of the powers of shape models is that they are fit in a self-supervised fashion, and are generative. To show the utility of this, we trained a logistic regression classifier on hybrid NSM *z* for each prediction task. Results in Tab. III show that the simple classifier is one of the best for disease staging. We tested latent interpolation smoothness by assessing the effect of interpolation on reconstructions and disease prediction. Using the hybrid NSM we interpolated *z* from the mean healthy (KL 0) to the mean severe OA (KL 4) shapes in the test set, generated synthetic surfaces, and applied the logistic classifiers on each *z* to determine KL and MOAKS cartilage thinning grades Fig. 6. Shape space interpolation generated smooth physical interpolations and predicted smooth transitions of disease states Fig. 6. This general-purpose representation is powerful because application to other image modalities only requires a segmentation mask, whereas CNN-based approaches would require re-training on entirely new datasets. Furthermore, interpolation could be used to track individual patient disease trajectories over time, opening the door to novel ways of understanding disease.

The generative nature of the NSM enables further validation that classifiers applied to the latent *z* are capturing features of interest. Fig. 7 takes the latent *z* fitted to a patient, and interpolates it along the vector defined by a logistic regression classifier that predicts medial cartilage holes. Simple linear interpolation along the fitted vector precisely controls the size of the cartilage hole on the medial side. This visualization improves confidence in the fitted model, but may also enable entirely new applications. For example, it is possible to precisely add and remove specific, localized, features of disease and therefore to generate synthetic versions of a patient’s anatomy. These synthetic digital twins can be used for *in silico* simulations to determine the effects of specific disease features on tissue biomechanics [12], or to inform surgical planning such as cartilage repair [71], [72]. Importantly, this example uses simple linear interpolation; future work can leverage latent diffusion models [73] to advance this capacity.

## VI. Conclusion

We contribute a hybrid explicit-implicit NSM which demonstrates state-of-the-art performance for anatomic reconstruction, and clinical outcome prediction. Model training and evaluation were enabled by our new ShapeMed-Knee dataset. All shape models were capable of simple OA staging. Hybrid NSMs uniquely quantified the location and size of OA features. While hybrid NSMs provide current state-of-the-art bone and cartilage reconstruction, further advances applied to our ShapeMed-Knee dataset have the potential to improve results and, in turn, our understanding of OA. We encourage the community to leverage ShapeMed-Knee data and benchmarks to tackle the unique challenges presented by modeling multiple anatomic surfaces and encoding meaningful disease-specific information.

## Data Availability

All data produced are available online at https://huggingface.co/datasets/aagatti/ShapeMedKnee

## APPENDIX

### A. Mesh Processing

Surface meshes were created for each subject using established methodologies and open source tools [46], [74], [75], [76] outlined in the following. First, binary segmentation masks for each tissue were Gaussian filtered (*s*: bone=0.5mm, cartilage=0.1mm), and surfaces were extracted using a continuous marching contours algorithm at a threshold of 0.5 [77]; marching cubes was applied using the VTK implementation [74] via the pyMSKT library [46]. To ensure meshes were watertight, the Robust Watertight Manifold Surface Generation method [78] implemented in Point Cloud Utils [76] was used with a mesh resolution of 200,000. The watertight meshes were slightly dilated compared to the original mesh, thus each point of the manifold mesh was projected back onto the original surface to preserve the original topology. Once watertight meshes were created, they were decimated to have 50% of the vertices, resulting in ∼ 250, 000 vertices per bone mesh, and ∼ 150, 000 vertices per cartilage mesh, again using the Point Cloud Utils library [76]. These methods have been commonly used for analyses of bone and cartilage surfaces in osteoarthritis (OA) [35], [51]. Segmentations, generated surface meshes, and all code for generating the surfaces from the segmentations will be publicly shared.

### B. Cartilage Biomarker Calculation

Cartilage biomarker generation was done using standard definitions of anatomic regions of interest established in the literature [79], [48]. Thickness calculations were performed using a normal vector method that is well established in the literature and is comparable to other approaches (e.g., field lines, nearest neighbour) [80]. Processing was done in three steps:

#### 1) Divide cartilage into subregions

Cartilage is divided into 5 subregions (Anterior, medial and lateral central, and medial and lateral posterior) [79], [48]. See Fig. 2 for visualization of the 5 cartilage regions. To divide the cartilage, we identified three anatomic points along image axes: anterior-posterior (forward-backward) *x*, inferior-superior (up-down) *y*, medial-lateral (side-to-side) *z*. The three points are:

The trochlear notch point along the

*x*axis*x*_{t}was determined by flattening the segmentation along the y-axis, flood filling, and then performing an iterative search for the most anterior (negative*x*) point of the posterior border of the cartilage between the medial and lateral femoral condyles.The most posterior (backward; positive

*x*axis) bone point in the*x*axis*x*_{pb}is the femoral bone voxel with the most positive position.The center of the medial and lateral tibial cartilage

*c*along the*z*axis*z*_{c}.

These three points were then used to divide femoral cartilage into 5 subregions, creating a femoral cartilage subregion mask. i) Anterior cartilage is any point where *x* < *x*_{t}, ii) medial weight-bearing cartilage is any point where *x*_{t} < *x* < *x*_{pb} ∧ *z* > *z*_{c}, iii) lateral weight-bearing cartilage is any point where *x*_{t} < *x* < *x*_{pb} ∧ *z* < *z*_{c}, iv) medial posterior cartilage is any point where *x* > *x*_{pb} ∧ *z* > *z*_{c}, and v) lateral posterior cartilage is any point where *x* > *x*_{pb} ∧ *z* < *z*_{c}. [80], [79], [48]

#### 2) Compute vertex-wise cartilage thickness

Cartilage thickness was assigned to each bone vertex by projecting a vector normal to the surface and if the vector intersected the cartilage mesh twice, then the Euclidean norm of the intersections was calculated as the cartilage thickness and assigned to the respective bone vertex. Otherwise, a thickness of 0 was assigned.

#### 3) Assign vertex subregions

Bone vertices with non-zero cartilage thickness values were assigned a cartilage region label. Vectors were again projected normal to the surface and the 3D cartilage subregion mask was probed 100 times along this vector between the two cartilage intersections. The bone vertex was assigned the label with the most frequent cartilage subregion determined from the probe.

### C) Mesh Quality Control

We generated static images from 4 orthogonal views (top/superior, bottom/inferior, front/anterior, back/posterior) of the registered versions of every bone mesh (Sec. III-A.1 *Bone Surface Registration*) in the dataset and manually reviewed them. This process identified 57 bones with obvious and large errors, Fig. A1 includes examples of the static images used for quality control as well as of an excluded example.

#### 1) Registration artifacts

These quality control static images highlight errors or artifacts that can occur as a result of non-rigid registration. While the original downsampled meshes (20,000 vertices) used for shape modeling are normally clean with regularly spaced triangles, non-rigid registration can introduce artifacts when there is discordant anatomy between two surfaces. For example, Fig. A2 shows an example of a mesh with low-resolution on the mesh surface at diseased regions of the bone surface, this is likely at least in part to explain why the SSM-based approaches do not perform as well for reconstruction compared to the NSM-based approach. Another example Fig. A3 shows abnormal surface triangles in a generally smooth surface region, this was a rare artifact but highlights how registration-based approaches can fail leading to sub optimal results.

### D. Prediction Tasks Generation

During creation of the dataset, MRI Osteoarthritis Knee Score (MOAKS) grades were converted from 4 labels to 3 (osteophyte) or 2 (cartilage thinning, cartilage hole) labels due to data imbalances. Fig. A4 shows histograms of the MOAKS scores (osteophyte, cartilage thinning, cartilage hole) before and after reclassification.

The benchmark models used in this study included three shape models (implicit neural shape model (NSM), hybrid NSM, and a statistical shape model (SSM)) as well as two convolutional neural network (CNN) based approaches. Additional implementation and training details for each are provided below.

### E. Neural Shape Model

The NSM model was built using PyTorch. Training and inference were performed on a single graphics processing unit, either a Nvidia A6000 or 2080ti.

#### a) Training

During training, we used separate learning rates and schedules for the network weights and the latents *z*. For both sets of parameters, learning rate (lr) was decayed as *lr* = *lr*_{0} ×*f* ^{(e/i)} where *lr*_{0} is the lr at time zero, *f* is the update factor *e* is the current epoch, and *i* is the interval which *lr* is updated. Network weights had parameters of *lr* = 5× 10^{−3}, *f* = (1*/*1.05), *i* = 16.67 and latents *z* had parameters of *lr* = 10^{−4}, *f* = 0.1, *i* = 1000.

The latent regularization weight from Eq. (1) was (1*/s*^{2} = 10^{−4}; *σ* = 100. The weight had a linear warmup over the first 100 epochs and was then cyclically annealed with 5 cycles over the training period (2,000 epochs). The cyclic anneal weight *β* for each cycle was defined using Eq. (A1) where *t* is the epoch for the current cycle and *T* is the number of epochs in each cycle (2000*/*5). We clamped signed distances *s* at |*s*| = 0.1 for the implicit NSM and |*s*| = 1 for the hybrid NSM.

#### b) Reconstruction

First, the bone/cartilage pair to be reconstructed were similarity registered to the mean bone shape of the NSM and then the bone/cartilage surfaces were scaled to be within a unit sphere, the same as during training Sec. IV-A. Then, the network weights were frozen and a randomly initialized latent *z* was optimized to reconstruct the shape. Specifically, for each iteration, 20,000 points were randomly sampled from the surface, and an ℒ_{1} loss between the network predicted signed distance *s* and the actual distance (0) was optimized using the Adam optimizer. The lr was decayed by a factor of 0.9 every 20 epochs. Optimization was performed for a maximum of 2,000 epochs with a patience of 50 epochs. Predicted *s* were clamped at |*s*| = 0.1. No latent regularization was used.

#### c) Hybrid NSM

The hybrid NSM is similar to previous work [25], [26]. The overall architecture is described in Fig. 4. A global latent *z* was passed through a multilayer perceptron (MLP) yielding a vector of length 2048 that was reshaped to 2 ×2 ×512 using a fully connected layer and input into a CNN. Our CNN had 5 2D transpose convolution layers, with stride 2 and 512 channels as outputs. The final output layer of the CNN was sized 64 × 64 × 384; features maps were split into 128 per plane. For a given sampled point, a length 128 *z* was obtained per plane via bilinear interpolation and the *z*s were combined via summation. The local *z* and the sampled points xyz position were concatenated and input into a 3-layer MLP with width 512, *ReLU* activations, and a length two output with a tanh activation.

### F. Statistical Shape Model

Registered meshes described in paragraph III-A.1.b were used to fit the model using the available training data by creating a *M* ×*N* matrix where *M* is the number of training examples and *N* is the number of columns/features. In our case, *N* = (3 + 1) ×20, 000 = 80, 000 where 20,000 is the number of vertices, 3 represents XYZ dimensions, and 1 is for the cartilage thickness features stored at each vertex. Principal component analysis (PCA) was applied to this matrix to obtain eigenvectors *v* and eigenvalues *λ*. Normalized principal component (PC) scores for training and testing data were obtained by projecting registered points onto *v* and normalizing by . Bone surfaces, and coinciding cartilage thickness per vertex, were reconstructed by · *v* where *k* is the number of PCs used in the reconstruction.

### G. Convolutional Neural Network

We trained a 3D DenseNet121 CNN on two distinct types of medical imaging data: raw DICOM images and segmentation masks. This approach allowed us to evaluate the significance of pixel magnitude information.

We used an input size of 384 ×384 ×80 and output channels being equal to the product of the number of tasks and classes per task. To accommodate our requisite prediction tasks, six model variants were devised one for each of the following tasks: OA Staging, OA Diagnosis, Future OA Incidence, Future knee replacement, MOAKS Ostephytes, and finally MOAKS Cartilage Thinning and MOAKS Cartilage Hole as a single model.

Image pre-processing included normalization of each individual image to have a mean of 0 and a standard deviation of 1 as well as standardization of image orientation. so that the first dimension extends from left to right, the second from posterior to anterior, and the third from inferior to superior. Inputs were then resampled to a uniform spacing of 0.3645 mm inplane and 1.4 mm out-of-plane using bilinear interpolation. Subsequently, the images were center-cropped and padded to dimensions of 384×384 in-plane and 80 out-of-plane.

To optimize the CNNs, we used the AdamW optimizer, with initial *lr* = 10^{−5}, exponential decay with gamma=0.8, and weight decay of 0. Training used the same loss functions as the shape-model MLPs Tab. A1 and a batch size of 8. Trainnig was performed on a single Nvidia A6000 GPU.

### H. Reconstruction

#### a). Cartilage Biomarkers

Cartilage biomarkers computed from reconstructions were compared to the original mesh calculations. Absolute error was determined using root mean squared error (RMSE) for all regions, and separately for each KL grade Fig. A5. Similarly, the standard deviation of the difference (SDD) was calculated by disease category and regions Fig. A6; the SDD provides a measure of consistency that accounts for bias between the methods. Consistently better SDD compared to RMSE indicates that each method had a small bias compared to the reference segmentation method, this is confirmed by visualizing the distributions of errors between the ground truth and the reconstructed surfaces Fig. A7. It is also apparent that different regions and models had different biases. For example, in the anterior femur, the NSMs underestimated thickness, whereas the SSM overestimated thickness.

During training, larger datasets, larger latent sizes, and hybrid (vs. implicit) architectures improved reconstructions. Specifically, bone and cartilage surface reconstruction accuracies were measured using ASSD, and cartilage biomarker reconstructions were compared to the reference using RMSE and SDD Fig. A8. These figures showed that increasing dataset size increased performance. Increasing the latent size (black lines) considerably improved reconstruction performance.

Reconstruction with the SSM using progressively more PCs improved reconstruction of surface ASSD (Fig. A9), and cartilage biomarkers (Fig. A10) up to the the number of components needed to explain 99% of the variance in the original dataset. When plotted with the x-axis as the percent explained variance, it appears as though ASSD is exponentially improving even at 99% explained variance. However, the number of PCs required for a given increase in explained variance is increasing exponentially. When ASSD is plotted as function of the number of PCs it is apparent that ASSD improvement is actually exponentially decaying Fig. A9.

#### b). Reconstructing Erroneus Surfaces

To determine how a trained NSM reconstructs a mesh with artifacts, we used the fitted hybrid NSM to reconstruct the surface of one of the 57 meshes excluded due to obvious errors. Fig. A11 shows surfaces of the original erroneous surface, the reconstructed surface, and an overlay of both surfaces on the original MRI. These data demonstrate that the NSM faithfully reconstructs plausible anatomical surfaces while filling in the corrupted regions with reconstructions based on the learned priors. This finding supports previous research which indicates that NSMs can be used as a means of refining automated segmentations to ensure anatomic plausibility [38].

### I. Prediction Tasks

MLP hyperparameters were determined via a grid search over depth (2,3), width (64-256), dropout (0.2, 0.4), lr (10^{−3} to 10^{−5}), and batchsize (64-512). The two future prediction tasks (knee replacement, osteoarthritis) were trained with a class imbalance weight. MLP hyperparameters were determined separately for each task, but the same parameters were used across shape models (SSM, hybrid NSM and implicit NSM). Specific parameters are described in Tab. A1. Different tasks leveraged different loss functions. OA diagnosis, MOAKS cartilage thinning, and MOAKS cartilage hole all used binary cross entropy. Future knee replacement (KR) and future OA diagnosis both used binary cross entropy with the positive class (smallest) weighted inversely proportional to its relative occurrence. KL grading and MOAKS osteophyte scores both used an ordinal regression loss [63].

#### a) Disease Staging

To visualize differences in KL grading between the methods, we plot confusion matrices for each of the shape and the CNN models Fig. A12. All models had good performance with quadratic weighted kappa *κ* ≥ 0.69.

#### b) Disease Diagnosis

All models performed well predicting OA with area under the receiver operating characteristic curve (AUROC) ≥ 0.88 Fig. A13.

#### c) Advanced Disease Diagnosis

MOAKS scores per region are provided in Figs. A14 to A16. The hybrid NSM performed best for all three MOAKS prediction tasks. Cartilage predictions performed best for the anterior lateral and the central medial regions, which are also common locations of cartilage deterioration in OA Figs. A15 and A16.

#### d) Future Prediction

All models performed poorly for future OA disease prediction and performed better for future knee replacement prediction Fig. A17. The hybrid NSM performed best for future knee replacement risk, and the CNN-image performed best for future OA prediction. Overall knee deformity is a decision factor of whether a patient receives knee replacement; the hybrid NSM, which better captured and localized shape features of OA disease Figs. A14 and A16 may have also better represented features common in individuals who undergo knee replacement. Better future disease prediction by the CNN on raw image data may be explained by it leveraging features other than shape, such as bone bruises or general joint inflammation Fig. A17.

### J. Interpretability

Interpolation in latent space from the mean healthy (KL 0) to the mean severe OA (KL 4) knees showed progressive increases in disease-specific features of cartilage Fig. 6, as well as osteophytes Figs. A18 and A19. Not only did the bones show a progressive increase in osteophyte sizes, but when the logistic classifiers were used they also predicted progressively higher osteophyte scores Figs. A18 and A19. As shown in all of the interpolation figures Figs. 6, A18 and A19, the first bone is correctly classified as KL = 0, however, the last bone is misclassified as KL = 3. Osteoarthritis is a heterogeneous disease with many different presentations leading to the same severity score (e.g., KL 4). It is likely that in this example the average of all subjects with KL 4 reduced extremes of any one presentation of OA thus resulting in moderate severity in most features and leading to an overall shape that was classified as KL 3 instead of KL 4.

## Footnotes

This work was supported in part by the National Institutes of Health R01 AR077604, R01 EB002524, R01 AR079431, P41 EB027060, the Wu Tsai Human Performance Alliance, and a CIHR Postdoctoral Fellowship.

(email: akshaysc{at}stanford.edu)

## References

- [1].
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵