GPT-4o Explained: Everything You Need to Know

Zikrul
By -
0
Twice as fast and half the price – what does GPT-4o mean for AI chatbots? Following their mysterious announcement, OpenAI has launched the latest version of their flagship model: GPT-4o.

The latest model doesn’t just get a dramatic boost in multimodal capabilities. It’s faster and cheaper than GPT Turbo. While mainstream media coverage has been captivated by the new flagship model’s video and voice capabilities for ChatGPT, the new cost and speed are equally impactful for those using GPT to power their applications.

What is GPT-4o?


What is GPT-4o?
What is GPT-4o?

In May 2024, the research and artificial intelligence (AI) company, OpenAI, just launched its latest AI language model which is predicted to be the flagship model with the name GPT-4o.

Previously, GPT was known as a technology that could make human work easier in various applications, from natural language processing to developing chatbots and virtual assistants. However, with GPT-4o, this technology is also expected to bring more sophisticated and effective convenience.

Reporting from the OpenAI page, GPT-4o is the latest AI model from OpenAI that presents intelligence equivalent to GPT-4, but is claimed to have speed, efficiency, and capabilities that are more sophisticated and more accessible.

In GPT-4o, the letter "o" itself refers to the word "omni" which means universal. This means that GPT-4o is designed to accept all types of input and produce all types of output in various formats, such as text, voice, images, videos, or combinations thereof that are tailored to user needs.


What model does ChatGPT-4 use?


The new flagship model comes with a list of exciting updates and new features: enhanced voice and video capabilities, real-time translation, more natural language capabilities. It can analyze images, understand a wider range of audio input, provide summarization assistance, facilitate real-time translation, and create charts. Users can upload files and have voice-to-voice conversations. It even comes with a desktop app.

In a series of launch videos, OpenAI employees (and partners like Sal Khan of Khan Academy) demonstrated the latest version of GPT preparing users for job interviews, singing, identifying human emotions through facial expressions, solving written math equations, and even interacting with other ChatGPT-4o.

The launch illustrates a new reality where an AI model can analyze your child’s notebook and respond. It can explain the concept of adding fractions for the first time, changing tone and tactics based on your child’s understanding — it can cross the line from chatbot to personal tutor.

How does GPT-4o work?


Details on how this latest AI model works are still limited. However, OpenAI revealed that the way GPT-4o works is similar to other GPT models. The difference is, its neural network is trained in an integrated manner, so it can adapt to process all types of input and output.

As a Generative Pre-trained Transformer (GPT), this model is certainly trained with many datasets. In fact, with billions of images and tens of thousands of hours of audio simultaneously. Therefore, it not only recognizes words, but can also understand the appearance and sound of the given topic.

In addition, GPT-4o is also known to use a transformer architecture which is also used by almost all modern AI models. This allows it to understand important parts of long and complex requests and remember information from previous conversations.

What Does GPT-4o Mean for LLM Chatbots?


AI chatbots running on LLMs are updated every time a company like OpenAI updates their model. If LLM agents are connected to a bot-building platform like Botpress, they receive all the benefits of the latest GPT model in their own chatbots.

With the release of GPT-4o, AI chatbots can now choose to use a higher-end model, changing its capabilities, price, and speed. The new model has a 5x higher speed limit than GPT-4 Turbo, with the ability to process up to 10 million tokens per minute.

For bots using audio integrations like Twilio in Botpress, a new world of voice-powered interactions has emerged. Instead of being limited to audio processing in the past, chatbots are one step closer to mimicking human interaction.

Perhaps most importantly, the cost is lower for paid users. Running a chatbot with the same capabilities for half the cost can drastically increase access and affordability worldwide. And Botpress users pay no additional cost for the AI ​​overhead for their bots — so these savings go directly to the creators.

And from the user’s perspective, GPT-4o means a much better user experience. Nobody likes to wait. Shorter response times mean higher user satisfaction for AI chatbot users.

1. Users Love Speed


One of the primary benefits of chatbot adoption is improving the user experience. And what could be more of a user experience improvement than reducing wait times?

“It’s definitely going to be a better experience,” Hamelin says. “The last thing you want to do is wait for someone.”

Humans hate waiting. Back in 2003, a study found that people were only willing to wait about 2 seconds for a web page to load. Our patience certainly hasn’t improved since then.

2. Everyone hates waiting


There are tons of UX tips out there for reducing perceived wait times. Often we can’t increase the speed of events, so we focus on how to make users feel like time is passing more quickly. Visual feedback, like images of loading bars, is there to shorten perceived wait times.

In a famous story about elevator wait times, an old building in New York was getting a lot of complaints. Residents had to wait 1-2 minutes for the elevator to arrive. The building could not afford to upgrade to a newer model and residents were threatening to break their lease.

A new employee, trained in psychology, discovered that the real problem was not the two minutes wasted, but boredom. He suggested installing mirrors so residents could see themselves or others while they waited. The complaints about the elevators stopped, and now, it is common to see mirrors in elevator lobbies.

Rather than taking shortcuts to improve the user experience - such as visual feedback - OpenAI has improved the experience at its source. Speed ​​is at the heart of the user experience, and no amount of trickery can match the satisfaction of efficient interaction.

3. Savings for Everyone


Using this new AI model to power applications just got cheaper. Much cheaper. Running AI chatbots at scale can be expensive. Your bot’s LLM is powered by determining how much you’ll pay for each user interaction at scale (at least at Botpress, where we match AI spend 1:1 with LLM costs).

And these savings aren’t just for developers using the API. ChatGPT-4o is the latest free version of LLM, alongside GPT-3.5. Free users can use ChatGPT applications at no cost.

4. Better tokenization


If you’re interacting with models in languages ​​that don’t use the Roman alphabet, GPT-4o will reduce your API costs.


The new model comes with improved usage thresholds. It provides a significant jump in tokenization efficiency, which is mostly concentrated in certain non-English languages. The new tokenization model requires fewer tokens to process input text. It is much more efficient for logographic languages ​​(i.e. languages ​​that use symbols and characters instead of letters).

The benefits are mostly concentrated in languages ​​that do not use the Roman alphabet. The savings are estimated as follows:
  • Indic languages, such as Hindi, Tamil, or Gujarati, have a token reduction of 2.9 - 4.4x
  • Arabic has a token reduction of ~2x
  • East Asian languages, such as Mandarin, Japanese, and Vietnamese have a token reduction of 1.4 - 1.7x

5. Closing the AI ​​digital divide


The digital age has brought about an extension of a long-standing and well-documented wealth gap - the digital divide. Just as access to wealth and robust infrastructure is limited to a select population, so is access to AI and the opportunities and benefits that come with it.

Robert Opp, Chief Digital Officer at the United Nations Development Programme (UNDP), explains that the presence of AI platforms has the ability to make or break a country’s development metrics:

“One big concern that we have, is that countries that are more prepared and skilled in AI platforms, both in terms of development and use, they could have a much faster development process and countries that don’t have the skills and capacity will be left behind.”

By halving the cost of GPT-4o and introducing a free tier, OpenAI is taking a significant step toward counteracting one of the biggest problems with AI — and directly addressing the inequality in the minds of policymakers and economists.

Positive PR for big AI is more necessary than enthusiasts might think. As AI becomes more ubiquitous in our daily lives, advocates and skeptics alike are asking how we can use AI ‘for good.’



According to AI PhD and educator Louis Bouchard, distributing wider access to AI is how we do just that: “Making AI accessible is one, if not the best, way to use AI ‘for good.’” The reason? If we can’t fully control the positive and negative impacts of AI technology—at least in the early days—we can ensure equal access to its potential benefits.

6. Expanded Multimodal Potential


The most common way to interact with business chatbots is through text, but the enhanced multimodal capabilities of OpenAI’s new AI models suggest that this may change in the future.

In the coming year, we’ll likely see a wave of developers launching new applications that take advantage of newly accessible audio, vision, and video capabilities. For example, a GPT-powered chatbot could have the ability to:
  • Ask customers for pictures of items they’ve returned to identify the product and ensure it’s not damaged
  • Provide real-time audio translation in conversations that take into account regional dialects
  • Tell if your steak is done from a picture of it in a pan
  • Serve as a free personal tour guide, providing historical context based on pictures of old cathedrals, providing real-time translations, and delivering a customized voice tour that allows for back-and-forth communication and questions
  • Power a language learning app that listens to audio input, can provide feedback on pronunciation based on a video of your mouth movements, or teach sign language through images and videos
  • Provide non-urgent mental health support by combining its ability to interpret audio and video, enabling low-cost speech therapy..

With AI models that can interpret both images and audio, our understanding of how LLMs can serve us is rapidly expanding.

7. Multimodality means accessibility


We’ve already seen enhanced multimodal capabilities used for social good. A perfect example is OpenAI’s partnership with Be My Eyes. Be My Eyes is a Danish startup that connects visually impaired users with blind volunteers. When users need help—like choosing the right canned product at the supermarket or identifying the color of a t-shirt—the app connects them with blind volunteers around the world via video on their smartphone.

OpenAI’s new vision capabilities could provide a more rewarding experience for Be My Eyes users. Instead of relying on human volunteers to visually decipher images or videos in real time, blind users can simply feed images or videos to their devices, which the model can respond to with audio information.

OpenAI and Be My Eyes, now trusted partners, are paving the way to independence for blind people around the world. Be My Eyes CEO Michael Buckley explains the impact:

"In the short time we've had access, we've seen performance that is unmatched by any image-to-text object recognition tool out there. The implications for global accessibility are huge. In the not-too-distant future, the blind and low vision community will be using this tool not only for a variety of visual interpretation needs, but also to have a greater degree of independence in their lives."

How will we judge LLM models in the future?


As competitors continue to race to be the cheapest and fastest - to create the cheapest and fastest LLM - this begs the question: how will we judge AI models in the future?

At some point in the future, the major LLM makers (likely OpenAI and Google) will reach a plateau in terms of how fast their models can run and how cheaply they can provide access. Once we reach a plateau in terms of cost and speed, how will we crown the market-leading model?

What will be the sign of the new times? Whether it’s the personalities available from your AI model, the video enhancement capabilities, the features available to free users, or new metrics beyond our current understanding, the next generation of LLMs is upon us...

Which app uses GPT-4?


After knowing what GPT-4o is and how it works, the next discussion is about the applications of using this AI model. Here are some uses of GPT-4o from OpenAI in applications and software that you can also access.

1. ChatGPT Free


First, there is ChatGPT Free which already uses GPT-4o. OpenAI also provides access to users to use ChatGPT for free. However, of course there will be limitations on access to advanced features, such as image recognition, uploading files, and analyzing data.

2. ChatGPT Plus


Next, there is ChatGPT Plus. This version of ChatGPT is paid. However, you will get full access to GPT-4o, without feature restrictions like free users get.

3. API Access


For Developers, you can access GPT-4o in the API provided by OpenAI. This allows you to integrate GPT-4o into your application and take advantage of all its features for various tasks.

4. Desktop Applications


OpenAI has also integrated GPT-4o into desktop applications, including a new application for Apple's macOS that was also launched in May 2024. This way, you can access and use this AI model directly from your computer or laptop through the installed application.

5. Custom GPTs


Not only that, OpenAI also offers a special model in the form of Custom GPTs that are integrated with GPT-4o through the GPT Store. Organizations and companies can develop special versions of this GPT-4o according to their respective business needs.

6. Microsoft OpenAI Service


Finally, OpenAI integrates GPT-4o into Microsoft OpenAI Service. In detail, users can access GPT-4o through Microsoft Azure OpenAI Studio. However, generally access to this service is still quite limited for anyone.

What are the limitations of GPT-4?


The main differences between GPT-4o mini and GPT-4o standard are size and cost. GPT-4o mini is a smaller and more cost-effective version, while GPT-4o standard is more expensive but performs better in benchmarks including MMLU.

While GPT-4o mini also supports text and vision, GPT-4o standard has broader support for a variety of input and output types. GPT-4o mini is ideal for low-cost, low-latency applications, such as customer support chatbots, while GPT-4o standard is better suited for applications that require high performance and deep analysis. 

Although claimed to be more sophisticated than previous versions of GPT, GPT-4o still has several limitations and risks, such as the following.

1. Inappropriate Output


This AI's ability to take pictures or videos is still limited by resolution and complexity. So, very detailed or high-resolution images may not be processed as accurately as simpler images. Likewise, audio transcription is rarely 100% correct. Especially if the user has a thick accent or uses certain technical words.

2. Security Issues


The sophistication of GPT-4o can also increase the risk of fraudulent calls or even deepfakes. This is done by manipulating video and audio to create fake content that some people who are less vigilant may consider real.

3. Limitations in Contextual Understanding


Despite having a relatively high contextual understanding of 128K, GPT-4o can still have difficulty maintaining coherence in very long conversations or texts. This can cause inconsistencies or contradictions in the output produced.


Conclusion


GPT-4o can respond to audio input in just 232 milliseconds, matching the performance of GPT-4 Turbo on English text and code, with significant improvements on non-English text. GPT-4o is also said to be better at understanding vision and audio than existing models. Previous models tend to use speech and text, either transcribing audio to text, text to text, or text to audio.

With GPT-4o, OpenAI trained a new model end-to-end across text, vision, and audio, with all inputs and outputs processed by the same network. However, because GPT-4o is the first model to combine all of these modalities, OpenAi is still exploring its capabilities and limitations.

Well, that's a complete explanation of the ins and outs of GPT-4o as the latest AI model released by OpenAI. From the improvements produced, the sophistication of GPT-4o is actually inseparable from the use of machine learning to improve the performance of the AI ​​system owned based on available data.
Tags:

Post a Comment

0Comments

Post a Comment (0)

#buttons=(Ok, Go it!) #days=(20)

Our website uses cookies to enhance your experience. Learn more
Ok, Go it!