1 Introduction

Dynamic susceptibility contrast-enhanced (DSC) magnetic resonance imaging (MRI) is an essential tool to assess perfusion deficits in acute ischemic stroke [6]. Given a perfusion sequence, traditional methods rely on an estimation of the arterial input function (AIF) and on a parametric model to generate perfusion maps. Automatic estimation of the AIF tends to be non-robust, whereas a manual selection of the AIF is impractical. Besides, the computation of perfusion maps costs time in the range of minutes, in situations where time is often critical.

In 2017, Song et al. [7] introduced temporal similarity perfusion (TSP) maps, together with a model-free, iterative process for their generation. The generated maps are compared to traditional time-to-peak (TTP) and mean-transit-time (MTT) maps to assess their clinical value for the detection of perfusion deficits. The proposed method operates without the need for AIFs, but it is fixed to the generation of TSP maps. Machine learning has been used extensively to post-process perfusion maps, e.g., to estimate tissue at risk (penumbra) [4]. Meanwhile, in McKinley et al. [5] regression of perfusion maps from raw DSC-MRI perfusion data in the presence of an externally provided AIF was demonstrated. Fully automatic end-to-end regression of DSC perfusion maps has not been approached so far.

In this paper, we present a model-free, convolutional neural network (CNN) based architecture to predict arbitrary perfusion maps in an end-to-end manner. There have already been applications of CNNs on dynamic contrast-enhanced (DCE) perfusion sequences [8]. However, our work is the first application on DSC-MRI perfusion sequences to the best of our knowledge.

2 Methods

In this section, we describe the proposed neural network architecture for predicting synthetic perfusion maps. Since the architectures we developed for predicting different types of perfusion maps are similar (see supplementary material, Fig. S1), we solely focus on the one used for predicting \(T_{max}\) perfusion maps.

2.1 Pre-processing and Augmentation

We pre-process both the raw perfusion data and the target perfusion maps. Each volume is padded with zeros to match the maximum size of any volume in the training dataset. Additionally, the perfusion sequence is complemented with volumes at the end to match the maximum sequence length of any perfusion sequence in the training dataset. For this, we reflect the data to generate frames for padding instead of using zero-filled volumes. The resulting perfusion maps are of size \(24 \times 256 \times 256\), the perfusion sequences of size \(80 \times 24 \times 256 \times 256\), where 80 is the number of frames and 24 the number of image slices.

After padding, we perform data standardization, i.e., we transform the data, so it has zero mean and unit variance. Finally, we apply voxel-wise temporal Gaussian smoothing for the perfusion sequence with \(\sigma _{t} = 1.0\).

From inspecting the training portion of our dataset, we conclude that the time of perfusion sequence start is not in a global relation to the time of bolus arrival in the brain, i.e., there are cases where the bolus arrives much earlier or later than in most other cases. Even though such cases occur rarely, we want our model to be able to handle arbitrary delays of the bolus. To prevent the model from learning a global bolus arrival time, we augment the training dataset by randomly offsetting the perfusion sequence by \(-5\) to 30 frames. A negative number means we remove frames at the beginning of the sequence and add padding frames at the end, a positive number indicates the opposite. The padding is generated via reflecting.

Fig. 1.
figure 1

An overview of the perfusion map prediction model.

2.2 Deep Architecture for Perfusion Map Regression

The goal is to predict voxel-values in a target perfusion map \(P\) based on the raw perfusion sequence \(S\). Our crucial assumption is that a voxel-value \(p_{x,y,z}\) in perfusion map \(P\) mostly depends on the sequence of voxels at the same location in the raw perfusion sequence, i.e., on \(\mathbf {s}_{1:T,x,y,z}\), where T indicates the total number of frames in the perfusion sequence. There are obvious limitations to this assumption, which will be discussed in Sect. 4. Based on this assumption, we use a CNN to capture the temporal evolution of the raw perfusion sequence voxel-wise. Figure 1 shows an overview of our architecture. The individual substructures will be briefly explained in the following sections.

Bolus Characterization. Our model has neither knowledge of the arterial input function nor the time of bolus arrival in the brain. Ignoring these aspects would significantly impair the performance of our model. Therefore, we add a bolus characterization structure (BCS) which should help capture the time of bolus arrival in the brain as well as the AIF. Guided by the fact that those characteristics are captured best by large blood vessels entering the brain, we select the input to the BCS to be a patch sequence from the perfusion sequence, located at the transition between the basilar artery and the posterior cerebral artery. The location of this patch is globally fixed, i.e., it is not fine-tuned to the individual volume. Therefore, it may happen that this patch does not contain the desired blood vessels for specific instances in our data. The BCS processes the supplied patch sequence via two 3D convolutional layers, encoding each patch into a vector of size 16. The sequence of encoded patches is forwarded to the sequence encoder.

Sequence Encoding. The sequence encoder handles every voxel sequence \(\mathbf {s}_{1:T,x,y,z}\) independently, together with the additional information supplied by the bolus characterization structure. Note that this information is the same for all \(\mathbf {s}_{1:T,x,y,z}\) of a perfusion sequence \(S\). For simplicity, we will describe how the sequence encoder handles one individual voxel sequence. In reality, the sequence encoder processes multiple voxel sequences concurrently.

Effectively, the sequence encoder works on three inputs: a sequence of voxel-values \(\mathbf {s}_{1:T,x,y,z}\), the sequence of frame times \(\in \mathbb {R}^{80}\), and the sequence of encoded patches from the bolus characterization structure. These sequences are concatenated along their non-temporal dimension, resulting again in a sequence of length 80. This result is passed through three 1D convolutional layers, of which the first two are followed by a max-pooling layer. The output is a vector of size 256, capturing the evolution over time of one voxel in the perfusion sequence.

Fig. 2.
figure 2

The architecture we used for local spatial correlation. The dense layer only operates on the last dimension of the data. Note that the shape of the data passed from the sequence encoder to the 2D convolution stems from the fact that we process the volumes in patches of \(1 \times 32 \times 256\) and that each voxel sequence \(\mathbf {s}_{1:T,x,y,z}\) is encoded into a vector of size 256. The first dimension of the data is redundant and is only depicted for clarity. Selu denotes the activation function used [3].

Spatial Correlation. So far, there is no flow of information between neighboring voxels, i.e., each voxel sequence \(\mathbf {s}_{1:T,x,y,z}\) is handled independently. Due to the low spatial resolution of the volumes along the axial axis, we omit spatial correlation between slices. To allow for some learned local filtering within slices, we apply 2D convolution to the voxel-wise output of the sequence encoder. Figure 2 illustrates this process.

Regression. Given the spatially correlated encoding of voxel sequence \(\mathbf {s}_{1:T,x,y,z}\), the prediction of the perfusion map voxel-value \(\hat{p}_{x,y,z}\) and the estimated uncertainty \(\hat{b}_{x,y,z}\) is performed via a fully connected layer with two output neurons and identity activation.

Loss. The training objective is based on heteroscedastic aleatoric uncertainty modeling [1], i.e., instead of a single prediction, the model outputs a mean and a measure of uncertainty per voxel. This effectively corresponds to a probability distribution per voxel. The loss is given by the negative log-likelihood of the observed target map data. We choose a Laplace distribution to assign probabilities to observed values since a Gaussian distribution is too light tailed to cope with the amount of noise in target perfusion maps. Equation 1 formally defines the negative log-likelihood \(l'\), given \(p_{x,y,z}\), \(\hat{p}_{x,y,z}\) and estimated uncertainty parameter \(\hat{b}_{x,y,z}\).

$$\begin{aligned} l' \left( p_{x,y,z}, \hat{p}_{x,y,z}, \hat{b}_{x,y,z}\right)&= \log \hat{b}_{x,y,z}+ \frac{\left| p_{x,y,z}- \hat{p}_{x,y,z}\right| }{\hat{b}_{x,y,z}} \end{aligned}$$
(1)

To be able to further focus the networks efforts onto voxel-values of high clinical importance and to simultaneously reduce the influence of noise, we apply a weighting scheme to the negative log-likelihood \(l'\) of each voxel. The weighting scheme is based on a predefined function I that assigns an importance to each value in \(P\) as shown in Eq. 2. The weight for a pair \(\left( p_{x,y,z}, \hat{p}_{x,y,z}\right) \) is given by the function W as shown in Eq. 3. The final voxel-wise loss l is given by Eq. 4. We use an average to compute the loss over multiple voxels.

$$\begin{aligned} I(z)&= \left\{ \begin{array}{ll} 1.0, &{} \text { if } 0.0 \le z \le 40.0\\ 0.1, &{} \text { else} \end{array}\right. \end{aligned}$$
(2)
$$\begin{aligned} W \left( p_{x,y,z}, \hat{p}_{x,y,z}\right)&= \max _{z=\min (p_{x,y,z}, \hat{p}_{x,y,z})}^{\max (p_{x,y,z}, \hat{p}_{x,y,z})} I(z) \end{aligned}$$
(3)
$$\begin{aligned} l \left( p_{x,y,z}, \hat{p}_{x,y,z}, \hat{b}_{x,y,z}\right)&= l' \left( p_{x,y,z}, \hat{p}_{x,y,z}, \hat{b}_{x,y,z}\right) \cdot W \left( p_{x,y,z}, \hat{p}_{x,y,z}\right) \end{aligned}$$
(4)

Training. The perfusion maps are processed in patches of size \(1 \times 32 \times 256\), the perfusion sequences in patches of size \(80 \times 1 \times 32 \times 256\). These patches are gathered in batches of size four before being passed to the network. For optimization, we use Adam [2] with an initial learning rate of \(5e^{-4}\). The learning rate is divided by two every four epochs. We rely on dropout regularization with a dropout rate of 0.5 for fully connected layers and do not use \(l_2\)-norm weight decay.

3 Experimental Setup and Results

The model was trained and evaluated on DSC-MRI perfusion data of patients with acute ischemic stroke. For a given perfusion sequence, the target perfusion maps were generated using oscillation index singular value decomposition in Olea Sphere\(^{\textregistered }\) 2.3 with default settings as used in clinical routine. The complete dataset contains 189 cases, of which we excluded 38 because the detected AIF was inaccurate, leaving us with a dataset of 151 cases. Approval for this retrospective study was obtained from the local ethics committee (KEK Bern, Switzerland, approval number: 231/14). Written informed consent was waived according to the retrospective nature of this analysis.

We randomly partitioned the dataset into training set, validation set and test set with ratios of 0.5, 0.2 and 0.3 respectively. The training set was used for optimizing the network’s weights, the validation set for model selection, and the test set for evaluating the final model. The selected hyperparameters are listed in Table S1 (see supplementary material).

The neural network was trained end-to-end, using the pre-processed perfusion sequences as input and the pre-processed perfusion maps from Olea as target maps. Training was performed on a Windows machine with an Intel Xeon E5-1630 v3 @ 3.7 GHz and a Nvidia GeForce GTX 1080, and on an Ubuntu machine with an Intel i7-4790K CPU @ 4.00 GHz and two Nvidia GeForce GTX 1070. All models were trained for 30 epochs, the selection of the final model was based on its performance on the validation set. The forward propagation took time in the range of seconds to compute a complete perfusion map.

Table 1. The MAEC of the predicted \(T_{max}\) maps for different models. The lines below model A show the performance of models where the listed component was removed.

3.1 Quantitative Results

To quantitatively evaluate and compare different models, we used a mean absolute error with clipping (MAEC) as performance measure. It is identical to a mean absolute error, except that voxel-values are clipped to the interval [0, 20] before computing differences. The clipping was done since values below 0 mostly correspond to air and the ones above 20 definitely indicate a perfusion deficit or are part of the noise. The exact values 0 and 20 were chosen because [0, 20] is a reasonable window for inspecting \(T_{max}\) maps.

Fig. 3.
figure 3

The influence on the MAEC for different models when shifting the raw perfusion sequence by a given number of frames.

The model described in Sect. 2.2 is referred to as model A in the remainder of this section. Table 1 shows the performance of model A and compares it to variants of model A where critical components introduced in Sect. 2.2 were removed. While model A performed best on the validation set, model B performed best on the test set. When manually inspecting the perfusion sequences of the validation and test set, we observed that the time of contrast bolus arrival varies more in the validation set than it does in the test set, with a standard deviation of 2.76 frames in contrast to 2.09 frames. Hence, the cases in the validation set did potentially benefit more from data augmentation via temporal shift. However, this temporal shift made training harder, which made the model perform slightly worse on cases where the bolus arrival delay was close to the mean delay. To further investigate the effectiveness of our augmentation, we measured the MAEC on the validation data for model A and model B after shifting the perfusion sequences by a number of frames. The results are shown in Fig. 3 and clearly indicate that the augmentation successfully helped the model to compensate for different times of bolus arrival.

Further, we observed that removing the spatial correlation, the bolus characterization structure or the loss weighting significantly decreased the model’s prediction accuracy on both the validation and the test set. It is evident that removing loss weighting increases the MAEC, since the loss weights assign high importance to the values that influence the MAEC.

3.2 Qualitative Results

Figure 4 shows the target map \(T_{max}\), the model’s prediction \(\hat{T}_{max}\) and the estimated variance of the prediction \(\hat{\sigma }^2\) for three samples from the test set.

Fig. 4.
figure 4

The target map \(T_{max}\), the predicted map \(\hat{T}_{max}\) and the estimated variance \(\hat{\sigma }^2\) of the prediction for three examples from the test set. Note that \(\hat{\sigma }^2\) is given by \(2\hat{b}^2\).

For the first two cases, the model’s prediction is very close to the target map. Compared to the target maps, the predictions tend to contain less high-valued noise and generally have a smoother appearance. We would have expected the model to predict high uncertainty where its prediction error is likely to be high, i.e., we would have hoped for a positive correlation between prediction error and \(\hat{\sigma }^2\). The \(\hat{\sigma }^2\) maps do not match those expectations. Instead, we observed a correlation between the predicted value and \(\hat{\sigma }^2\), meaning the prediction of correct high values seems to be hard for the model. This observation makes sense since there is a considerable amount of high-valued noise in the target maps.

The third row of Fig. 4 shows an example where our model failed. From observing the corresponding raw perfusion sequence and comparing it to perfusion sequences of examples where the model performed better, we noticed that the signal attenuation caused by the contrast agent is comparably weak for this case. Also, the signal is very noisy, partially due to slight head movements, which are amplified by the low axial resolution of the volumes. Given this additional information, we assumed that the bolus characterization structure was unable to correctly capture the bolus arrival in the brain, which led to a poor prediction.

4 Discussion and Conclusion

We made some simplifying assumptions in the presented approach, the most crucial one being that voxels in a perfusion map mainly depend on the perfusion sequence of voxels at the same location. This simplification does not always hold, especially when there was head movement during the sequence acquisition. An obvious solution to this is to register the individual volumes of the perfusion sequence before processing them any further. However, this is hard due to the low resolution of the volumes along the axial axis, which can lead to significant interpolation artifacts. Furthermore, it does not fit the concept of an end-to-end learning model. Another possible solution would be to make the sequence registration part of the model.

In conclusion, we presented a model-free, CNN-based method for inferring perfusion maps in an end-to-end manner. We demonstrated our method’s performance on an ischemic stroke dataset of 151 patients and have shown that the predictions are comparable to the target perfusion maps. We are currently working on a clinical evaluation of the synthetic perfusion maps in order to confirm the applicability of CNNs in real-world DSC-MRI perfusion imaging.