ChatGPT can now hear, speak, see, and understand multimodal prompts


ChatGPT will be getting new features in the coming weeks to make interaction more intuitive. Some of these are previously announced speech and image recognition features.

OpenAI is extending ChatGPT’s speech capabilities first in its apps for iOS and Android. You could already use voice chat instead of typing based on OpenAI’s open-source Whisper model, but it was one-way. Now ChatGPT can also respond by voice. Although Whisper works in languages other than English, OpenAI discourages its use in other languages – especially those without Latin characters.

OpenAI has developed its own text-to-speech model

For voice output, OpenAI has developed its own text-to-speech model, which is also being adapted by Spotify. This makes OpenAI a solution provider in this area as well, competing with startups like Elevenlabs that focus on synthetic voices.

OpenAI’s AI voice is capable of generating human-sounding synthetic voices in the style of the original voice from text with just a few seconds of sample audio. For the five ChatGPT voices, OpenAI worked with professional voice actors.



Watch the video: OpenAI

In addition to ChatGPT, the text-to-speech model is also used by Spotify, as mentioned above. The Swedish music streaming service uses it to translate podcasts into other languages in the voice of the podcast host. Spotify has released the first examples in Spanish. French and German will follow in the coming days and weeks.

  1. Lex Fridman Podcast – “Interview with Yuval Noah Harari”.
  2. Armchair Expert – “Kristen Bell, by the grace of god, returns”
  3. The Diary of a CEO with Steven Bartlett – “Interview with Dr. Mindy Pelz”

Video: Spotify

OpenAI has not yet announced whether other companies or individuals will have access to the new text-to-speech model. However, the announcement notes that due to the risk of abuse by voice clones, a controlled rollout will initially take place for selected usage scenarios – such as voice chat and Spotify podcasts.

Voice can be combined with another ChatGPT innovation, the ability to recognize and talk about content in images. This feature was announced at the launch of GPT-4 and is now being rolled out.


system card for GPT-4V(ision), OpenAI describes the model’s image recognition as “unreliable”. For example, when analyzing chemical structures, it misidentified substances such as fentanyl, carfentanil, and cocaine, but sometimes correctly identified dangerous components such as poisonous mushrooms.

The model was prone to hallucinations and could present incorrect facts in an authoritative tone. “This demonstrates that the model is unreliable and should not be used for any high-risk tasks such as identification of dangerous compounds or foods,” the paper says.

Here, OpenAI specifically warns against using GPT-4V in a scientific and medical context. The company also provides examples of the model refusing to answer to avoid providing feedback on possible biases.

Image: OpenAI

OpenAI limits ChatGPT’s image analysis for humans

Much of OpenAI’s announcement of the new ChatGPT features revolves around the promise of developing safe and useful AI. The time since the unveiling of GPT-4 and its associated image recognition has been used for intensive testing, the company says. Nevertheless, hallucinations could not be ruled out.

“We’ve also taken technical measures to significantly limit ChatGPT’s ability to analyze and make direct statements about people since ChatGPT is not always accurate and these systems should respect individuals’ privacy,” OpenAI writes. However, real-world use helps improve those protections, it adds.

Previously, there were reports that OpenAI was concerned that ChatGPT’s image understanding could be misused as a facial recognition system and should be restricted. The “Be My Eyes” app, which describes the environment for the visually impaired, disabled facial recognition months ago. In the system card for GPT-4V(ision), OpenAI writes that it is working on a feature that can describe faces without identifying people.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top