D-ID launches an AI video translation tool that includes voice cloning and lip sync

Image Credits: D-ID

AI video creation platform D-ID is the latest company to ship a tool for translating videos into other languages using AI technologies. However, in this case, D-ID also clones the speaker’s voice and changes their lip movements to match the translated words as part of the AI editing process.

The technology stems from D-ID’s earlier work — which you may recall from the viral trend a few years ago where users were animating their older family photos, and later those photos were able to speak. On the back of that success, the startup closed on $25 million in Series B fundraising in 2022 with an eye on serving its increasing number of enterprise customers in the U.S. who were using its technology to make AI-powered videos.

With the company’s now-launched AI Video Translate tech, currently being offered to D-ID subscribers for free, creators can automatically translate their videos into other languages to help them expand their reach. In total, there are 30 languages currently available, including Arabic, Mandarin, Japanese, Hindi, Spanish and French, among others. A D-ID subscription starts at $56 per year for its cheapest plan and the smallest number of credits to use toward AI features and then goes up to $1,293 per year before shifting to enterprise pricing.

D-ID suggests the new AI video technology could help customers save on localization costs when scaling their campaigns to a global audience in areas like marketing, entertainment, and social media. The technology will compete with other solutions for both dubbing and AI video.

For years, dubbing technologies have made it easier for video viewers to listen to audio in their own language but were often inaccessible to smaller creators. That’s been changing as companies improved access to technology. For example, YouTube released a multi-language audio feature designed to help its creators connect with a wider audience by translating their videos into other languages. Well-known creator MrBeast (Jimmy Donaldson) was among the early adopters, having used the tech to bring several of his popular videos to 11 more languages.

With AI, the ability to create, translate, or clone voices is also expanding. Microsoft this year announced it would use AI to translate and dub YouTube videos, and others, while you watch. In July, creator platform Vimeo unveiled tools to translate audio and captions and to do so by replicating the speaker’s voice with AI technology. Numerous companies also offer voice cloning or AI translation tools (or sometimes both), including those from Descript, ElevenLabs, Speechify, Veed, Camb.ai, Captions.ai, and Akool, to name a few, as well as tools that let you create videos using AI avatars that can speak dozens of languages, like those from HeyGen, Deepbrain AI and others.

Dubbing and lip sync AI libraries, like Wav2lip, have also made it easier for startups to build these sorts of tools while pitching to creators that they make it easier, and perhaps more affordable, to use AI technology. (D-ID’s newly developed proprietary model called Rosetta-1 powers AI Video Translate.)

D-ID says its new Video Translation technology will be available through D-ID Studio and its API. A one-month trial is being offered and further demos are on its website.

The company says videos can be between 10 seconds and 5 minutes in length, and file size should be under 2GB. The feature works with only one person in the frame and, for best result, they should be facing the camera with their face visible at all times.

Timekettle multi-language simultaneous interpretation system

Timekettle’s $699 translation hardware handles multiple languages at once

Timekettle multi-language simultaneous interpretation system

Image Credits: Timekettle

Personal translation devices have had a hugely transformative decade. Improvements to processing power, machine learning and cloud platforms have all played key roles in this development. The technology is increasingly becoming a mainstay of wireless earbuds, and the recent explosion of generative AI platforms will only serve to further these impressive results.

It’s easy to imagine a time in the not-so-distant future when real-time, in-person smartphone translation is a ubiquitous commodity. What, precisely, such a sea change would do to those companies building standalone devices remains to be seen, of course, but in the meantime, we’re seeing a truly world-changing technology grow increasingly accessible.

Image Credits: Timekettle

Timekettle first crossed our radar back in 2017, when TechCrunch was hosting an event in the startup’s home of Shenzhen. At the time, the young company was showing off a face to face communication device that looked like a pair of oversized earbuds. You take one, give the other to someone else and then start talking.

Announced today at CES 2024, the X1 Interpreter Hub is a more robust solution, designed for meetings. Timekettle calls it, “the world’s first multi-language simultaneous interpretation system” – a lofty claim, to be sure, but it’s certainly a compelling solution.

The system works out of the box, without having to download a separate app. For in-person meetings, two devices are touched together to initiate conversation translation. The handheld devices house earbuds, similar to past Timekettle products. All told, the X1 is capable of supporting up to 20 people at once in five languages.

Image Credits: Timekettle

The system can also handle virtual conversations. Remote users dial into the phone number associated with the product to access its translation capabilities.

The Timekettle X1 is available online starting today, priced at $699.

Read more about CES 2024 on TechCrunch

Timekettle multi-language simultaneous interpretation system

Timekettle’s $699 translation hardware handles multiple languages at once

Timekettle multi-language simultaneous interpretation system

Image Credits: Timekettle

Personal translation devices have had a hugely transformative decade. Improvements to processing power, machine learning and cloud platforms have all played key roles in this development. The technology is increasingly becoming a mainstay of wireless earbuds, and the recent explosion of generative AI platforms will only serve to further these impressive results.

It’s easy to imagine a time in the not-so-distant future when real-time, in-person smartphone translation is a ubiquitous commodity. What, precisely, such a sea change would do to those companies building standalone devices remains to be seen, of course, but in the meantime, we’re seeing a truly world-changing technology grow increasingly accessible.

Image Credits: Timekettle

Timekettle first crossed our radar back in 2017, when TechCrunch was hosting an event in the startup’s home of Shenzhen. At the time, the young company was showing off a face to face communication device that looked like a pair of oversized earbuds. You take one, give the other to someone else and then start talking.

Announced today at CES 2024, the X1 Interpreter Hub is a more robust solution, designed for meetings. Timekettle calls it, “the world’s first multi-language simultaneous interpretation system” – a lofty claim, to be sure, but it’s certainly a compelling solution.

The system works out of the box, without having to download a separate app. For in-person meetings, two devices are touched together to initiate conversation translation. The handheld devices house earbuds, similar to past Timekettle products. All told, the X1 is capable of supporting up to 20 people at once in five languages.

Image Credits: Timekettle

The system can also handle virtual conversations. Remote users dial into the phone number associated with the product to access its translation capabilities.

The Timekettle X1 is available online starting today, priced at $699.

Read more about CES 2024 on TechCrunch