Hi, I've been playing with this diffusion model library for a few days, it is great to have such library that allows common users to train audio data with limited resources.
I have a problem with regard to the training data and the output. I fed the unconditional model with mozilla's common voice dataset. I used only one language and the size is about 15k. I resampled them to 44.1k and padded them to 2^18 samples per file if shorter. And the unconditional results were okay, at least I could tell it's human speaking although never actually audible.
But when I replace the training data with music (mostly pure pianos, same sample rate but 2^17 samples per input tensor), the model is not generating outputs that sounds like piano, in fact they are mostly noise.
I used the same configurations for each layers for both models, tried lowering the downsampling factors or increase attentions heads, but no significant difference. Any tips on why my problem happens?
Hi, I've been playing with this diffusion model library for a few days, it is great to have such library that allows common users to train audio data with limited resources.
I have a problem with regard to the training data and the output. I fed the unconditional model with mozilla's common voice dataset. I used only one language and the size is about 15k. I resampled them to 44.1k and padded them to 2^18 samples per file if shorter. And the unconditional results were okay, at least I could tell it's human speaking although never actually audible.
But when I replace the training data with music (mostly pure pianos, same sample rate but 2^17 samples per input tensor), the model is not generating outputs that sounds like piano, in fact they are mostly noise.
I used the same configurations for each layers for both models, tried lowering the downsampling factors or increase attentions heads, but no significant difference. Any tips on why my problem happens?