[ 3 / biz / cgl / ck / diy / fa / ic / jp / lit / sci / vr / vt ] [ index / top / reports ] [ become a patron ] [ status ]
2023-11: Warosu is now out of extended maintenance.

/ic/ - Artwork/Critique

Search:


View post   

>> No.6255322 [View]
File: 133 KB, 1913x1230, noise.jpg [View same] [iqdb] [saucenao] [google]
6255322

>>6255302
>>6255291
Look, you can see what i'm talking about in this very article you sent me.

>However, we don't know p(xt−1∣xt)p(\mathbf{x}_{t-1} | \mathbf{x}_t)p(xt−1∣xt). It's intractable since it requires knowing the distribution of all possible images in order to calculate this conditional probability. Hence, we're going to leverage a neural network to approximate (learn) this conditional probability distribution, let's call it pθ(xt−1∣xt)p_\theta (\mathbf{x}_{t-1} | \mathbf{x}_t)pθ(xt−1∣xt), with θ\thetaθ being the parameters of the neural network, updated by gradient descent.

>reparametrize the mean to make the neural network learn (predict) the added noise (via a network ϵθ(xt,t)\mathbf{\epsilon}_\theta(\mathbf{x}_t, t)ϵθ(xt,t)) for noise level ttt in the KL terms which constitute the losses. This means that our neural network becomes a noise predictor, rather than a (direct) mean predictor.

>The neural network needs to take in a noised image at a particular time step and return the predicted noise.
>What is typically used here is very similar to that of an Autoencoder, The encoder first encodes an image into a smaller hidden representation called the "bottleneck", and the decoder then decodes that hidden representation back into an actual image. This forces the network to only keep the most important information in the bottleneck layer.

>In terms of architecture, the DDPM authors went for a U-Net. This network, like any autoencoder, consists of a bottleneck in the middle that makes sure the network learns only the most important information. Importantly, it introduced residual connections between the encoder and decoder, greatly improving gradient flow
>As can be seen, a U-Net model first downsamples the input (i.e. makes the input smaller in terms of spatial resolution), after which upsampling is performed.

Navigation
View posts[+24][+48][+96]