Industry News

OpenAI's new AI multi-mode model GPT-4o is available to all ChatGPT users, faster and at half the price

林妍溱2024-05-17 14:29:37Ithome

OpenAI announced a new generation of multi-modal AI model GPT-4o, which will be gradually provided to all ChatGPT service users, emphasizing that GPT-4o can respond to sound input as fast as human conversation reaction time.

After trying to attract media attention last week, OpenAI announced yesterday (13th) that the latest multi-mode AI model GPT-4o will be provided to all ChatGPT services, including the free version. For developers, the new model is 2x faster than GPT-4T, 5x more bandwidth-limited, and half the price.

OpenAI CEO Sam Altman pointed out that GPT-4o is smarter, faster, and has native multi-mode capabilities. Text and image input capabilities will now be gradually deployed to ChatGPT, including the free version, but images cannot be generated for the time being, and voice output and input are not yet available.


The latest announcement puts a damper on media speculation last week. Bloomberg , The Information , and Reuters have successively reported that OpenAI will launch a search service to challenge Google and Perplexity AI. However, Altman announced over the weekend that it is not GPT-5 or a search engine, but a "magic-like" ChatGPT and GPT-4 update function.


The o in GPT-4o stands for omni, which means that it can accept prompt input of any combination of text, sound, and image, and the generated output can also be an integration of text, sound, and image. OpenAI emphasizes that GPT-4o has high performance. It can respond to sound input in an average of 320 milliseconds, which is equivalent to human reaction time, and the fastest is only 232 milliseconds.

As all manufacturers do when announcing new models, OpenAI also provided data showing that GPT-4o's visual and audio understanding capabilities are superior to those of its predecessors and competitors. Among them, text, understanding, and programming performance are already at the level of GPT-4 Turbo. Its multilingual capabilities (especially English), sound translation and visual understanding capabilities are higher than those of GPT-4, GPT-4T, Claude 3 Opus, Gemini Pro 1.5, and Meta Llama3 400b.


OpenAI explains why the new model’s speech mode performance has improved. In the past, the speech mode operation under GPT-3.5 and GPT-4 was based on the continuous operation of three models: the first model transcribed sounds into text, GPT-3.5 or GPT-4 produced text dialogue content, and then the third model Three models convert text back into sound. Not only does it increase latency, but the GPT-3.5 or GPT-4 model loses a lot of information during the process, making it unable to observe tones, multiple speakers, background noise, nor can it produce laughter, sing, or express emotions. However, GPT-4o is a single model that can understand text, vision and sound. The output and input are all processed in the same neural network, which greatly increases the speed of interaction and the richness of performance.


OpenAI also provides multiple videos demonstrating the capabilities of the new model, including two GPT-4o-based chatbots pretending to be mobile phone operator call center personnel talking to customers; one chatbot asks questions, and the other describes the OpenAI employees it "sees" Image description, the former can also improvise songs based on its description; and the chatbot can have a smooth conversation with OpenAI employees, and can also laugh during the process, or automatically stop when humans interrupt.

The voice of the chatbot in the film is natural and vivid, and the media described it as very similar to the voice of Scarlett Johansson, the heroine of Altman's favorite movie "Her".


However, after explaining the powerful capabilities of GPT-4o, OpenAI emphasized its security. According to its Preparedness Framework and human assessments, the new model's cyber security, CBRN (chemical, biological, radiological and nuclear) threats, deception capabilities and model autonomy capabilities are all below moderate. The company also emphasized that GPT-4o has been evaluated by an external team and more than 70 external cross-field experts to help reduce possible social psychology, bias and misinformation risks.


Starting today, OpenAI will gradually deploy the text and image input and text output capabilities of GPT-4o to all ChatGPT, including the free version, but the paid Plus version has a maximum message input limit of 5 times. In order to cope with the possible voice deepfake risk of ChatGPT voice mode (voice mode), the sound output is limited to a few limited sounds, and it is also said that existing security policies will be followed. The alpha version of the GPT-4o-based voice mode will only be available to ChatGPT Plus in the next few weeks.


For developers, API access to GPT-4o text and visual models is now available. Compared with GPT-4 Turbo, the new model is 2 times faster, has 5 times higher limited bandwidth, and is only half the price. OpenAI plans to provide voice and video capabilities to a small number of users and trusted partners within a few weeks.

Declare:The sources of contents are from Internet,Please『 Contact Us 』 immediately if any infringement caused