Zero-Shot Mono-to-Binaural Speech Synthesis
We present ZeroBAS, a neural method to synthesize binaural speech from monaural speech recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural speech synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get an initial binaural synthesis that can be refined by iteratively applying a pretrained denoising vocoder. Furthermore, we find this leads to generalization across room conditions, which we measure by introducing a new dataset, TUT Mono-to-Binaural, to evaluate state-of-the-art monaural-to-binaural synthesis methods on unseen conditions. Our zero-shot method is perceptually on-par with the performance of supervised methods on previous standard mono-to-binaural dataset, and even surpasses them on our out-of-distribution TUT Mono-to-Binaural dataset.
Model

Binaural Speech Dataset
Mono | BinauralZero | WarpNet | BinauralGrad | NFS | Ground Truth |
---|---|---|---|---|---|
TUT Mono to Binaural Dataset
Mono | BinauralZero | WarpNet | BinauralGrad | NFS | Ground Truth |
---|---|---|---|---|---|