ControlNet Training in pixel space rather than latent space is hard to produce good results

 I have a problem when training ControlNet.
I find the model desn't generate images according to the input condition, regardless of more than 50k iterations
Maybe the reason is that direct learning in pixel space is hard compared with latent space.
Thanks!