🖼️ Image → 🎙️ Speech (CPU)

  1. Caption with BLIP-2 → 2) Speak with SpeechT5 (HiFiGAN vocoder).
    First run downloads models and speaker embeddings — please wait.