SF

Stanisław (Stan) Findeisen

Software Engineer/Architect

I transform complex requirements into elegant software architectures.

Playing with Voice AI

I'm interested in speech recognition (speech-to-text) and speech transformation (speech-to-speech), particularly in the context of online and offline translation between natural languages. In this short experiment, I translated a single Polish audio sentence into several languages, including English, German, and French. Here are the results:

Polish (the original)
English
German
French

As you can hear, all the versions retain the original speaker's voice. The English translation sounds a bit awkward, due to the strong Russian-style accent. Or perhaps, this is due to the fact that each version maintains the same audio length (23 seconds) and similar phrase timing across languages. On the other hand, the German translation sounds almost natural — at least to my ears, though I'm not a native German speaker. As for the French, I have no clue — I'll leave that judgment up to you!

This experiment was done using ElevenLabs' Dubbing Studio. Since their free plan doesn't directly support audio-to-audio translation, I had to first convert the input mp3 file to mp4:

ffmpeg -loop 1 -f lavfi -i color=c=black:s=1280x720:d=0.1 -i KPalysOP-PL-orig.mp3 -c:a copy -c:v libx264 -tune stillimage -shortest KPalysOP-PL-orig.mp4

Then, I converted it back to mp3:

ffmpeg -i KPalysOP-DE.mp4 -q:a 0 -map a KPalysOP-DE.mp3 .

The original audio input was taken from here.