One very naive idea is to directly learn the score function using regression. Here a score-based model $s_{\theta}(x_t,t)$ is learned to approximate the score function. Any neural network that maps an input vector $\mathbf{x} \in \mathbb{R}^d$ to an output vector $\mathbf{y} \in \mathbb{R}^d$ can be used as a score-based model, as long as the output and input have the same dimensionality. This yields huge flexibility in choosing model architectures. We can estimate the score using the Fischer-divergence (https://arxiv.org/pdf/1905.07088.pdf), which is defined as:
$$
\min_{\theta}\mathbb{E}_{t \sim U(0,T)}\mathbb{E}_{x_t \sim q_t(x_t)}[||s_{\theta}(x_t,t) - \nabla_{x_t} \log q_t(x_t)||_2^2].\tag{1}
$$
The score-based model is trained by minimizing the above objective function. The first expectation can be approximated by sampling $t$ from a uniform distribution $U(0,T)$. However for the second expectation we need to be able to sample from the distribution $q_t(x_t)$ and estimate $\nabla_{x_t} \log q_t(x_t)$. However $\nabla_{x_t} \log q_t(x_t)$ is not tractable and we only have access to samples from the underlying data distribution.
We can use the samples from the data distribution and diffuse individual data points based on a variance preserving SDE. Here $q_t(x_t|x_0)=\mathcal{N}(x_t;\gamma_t x_0, \sigma_t^2\mathbf{I})$, with $\gamma_t = \exp(-\frac{1}{2}\int_0^t\beta(s)ds)$ and $\sigma_t^2 = 1-\exp(-\int_0^t\beta(s)ds)$. Now we can sample from the data distribution $x_0 \sim p(x_0)$ and then sample from the diffusion process $x_t \sim q_t(x_t|x_0)$ using the normal distribution. the resulting probaility density is now tractable and we can estimate the gradient of the log density. The score-based model $s_{\theta}(x_t,t)$ is trained by minimizing the following Denoising Score Matching objective (https://arxiv.org/abs/1907.05600, https://arxiv.org/abs/2011.13456):
$$
\min_{\theta}\mathbb{E}_{t \sim U(0,T)}\mathbb{E}_{x_0 \sim p(x_0)}\mathbb{E}_{x_t \sim q_t(x_t|x_0)}[||s_{\theta}(x_t,t) - \nabla_{x_t} \log q_t(x_t|x_0)||_2^2]. \tag{2}
$$
Since we model the diffusion process using a normal distribution which is variance preserving our score based model will have learned the score of the marginal distribution $q_t(x_t)$ after training:
$$
s_{\theta}(x_t,t) \approx \nabla_{x_t} \log q_t(x_t)
$$
Based on the Denoising Score Matching objective (Equation 2), we can derive the following implementation:
Step 1. Noise Prediction:
Since we are using a normal distribution to model the diffusion process, we can rewrite $x_t$ as $x_t = \gamma_t x_0 + \sigma_t \epsilon$, where $\epsilon \sim \mathcal{N}(0,\mathbf{I})$. Now we can rewrite the Score function as:
$$
\nabla_{x_t} \log q_t(x_t|x_0) = - \nabla_{x_t} \frac{(x_t - \gamma_t x_0)^2}{2\sigma_t^2} = -\frac{(x_t - \gamma_t x_0)}{\sigma_t^2} = -\frac{(\gamma_t x_0 + \sigma_t \epsilon - \gamma_t x_0)}{\sigma_t^2} = -\frac{\epsilon}{\sigma_t}
$$
We can see that the score function is a function of the noise $\epsilon$ and $\sigma_t$. We can use a neural network to predict $s_{\theta}(x_t,t) := -\frac{\epsilon_{\theta}(x_t,t)}{\sigma_t}$. The objective function can now be rewritten as:
$$
\min_{\theta}\mathbb{E}_{t \sim U(0,T)}\mathbb{E}_{x_0 \sim p(x_0)}\mathbb{E}_{\epsilon \sim \mathcal{N}(0,\mathbf{I})}[\frac{1}{\sigma_t^2}||\epsilon - \epsilon_{\theta}(x_t,t)||_2^2]. \tag{3}
$$
Note that we can also model $s_{\theta}(x_t,t)$ as defined before and change the objective to:
$$ \min_{\theta}\mathbb{E}_{t \sim U(0,T)}\mathbb{E}_{x_0 \sim p(x_0)}\mathbb{E}_{\epsilon \sim \mathcal{N}(0, \mathbf{I})}[\frac{1}{\sigma_t^2}||\sigma_t s_{\theta}(x_t,t) + \epsilon||_2^2]$$
Step 2. Loss Weighting:
We can also add a weighting term to the objective function to balance the loss between the different time steps. This is done by adding a weighting term $\lambda(t)$ to the objective function:
$$
\min_{\theta}\mathbb{E}_{t \sim U(0,T)}\mathbb{E}_{x_0 \sim p(x_0)}\mathbb{E}_{\epsilon \sim \mathcal{N}(0,\mathbf{I})}[\frac{\lambda(t)}{\sigma_t^2}||\epsilon - \epsilon_{\theta}(x_t,t)||_2^2]. \tag{4}
$$
The weighting term can defined as $\lambda(t) = \sigma_t^2$ to favour perceptual quality or $\lambda(t) = \beta(t)$ which is the negative ELBO and the same objective as for the denoising diffusion models if noise $\epsilon$ subtraction is used (discussed in the "Illustration of Diffusion Process").
You can check out "Elucidating the Design Space of Diffusion-Based Generative Models" (https://arxiv.org/abs/2206.00364) for more details on the different weighting schemes as well as training and model optimizations.
Step 3. Variance Reduction and Numerical Stability:
If we choose $\lambda(t) = \beta(t)$ the loss is heavily amplified for sampling $t$ close to $0$ since $\sigma_t^2 \rightarrow 0$ for $t \rightarrow 0$. This can lead to numerical instabilities and variance in the training process. To reduce the variance we can train with a small time cut-off $t_{cut}$ (for example for $t_{cut} \approx 10^{-5}$): $t \sim U(t_{cut},T)$. This is a very easy way to reduce the variance in the training process. Alternatively we can use importance sampling with a sampling distribution $p(t) \propto \frac{\lambda(t)}{\sigma_t^2}$ to reduce the variance.