Comparing GPT-4o, GPT-4, and Gemini 1.5: A Performance Analysis

OpenAI’s recent unveiling of GPT-4o marks a significant milestone in the evolution of AI language models and their interaction capabilities.

Comparing GPT-4o, GPT-4, and Gemini 1.5: A Performance Analysis


One of the standout features is its support for live interaction with ChatGPT, including the ability to handle conversational interruptions seamlessly.

Despite a few glitches during the live demo, I was thoroughly impressed by what the team has achieved.

The excitement continued as OpenAI immediately granted access to the GPT-4o API post-demo.

In this article, I will share my independent analysis comparing the classification abilities of GPT-4o, GPT-4, and Google's Gemini and Unicorn models using an English dataset I developed.

Which of these models excels in understanding English? Let’s find out.

Comparing GPT-4o, GPT-4, and Gemini 1.5: A Performance Analysis


What’s New with GPT-4o?

At the forefront is the Omni model, designed to seamlessly understand and process text, audio, and video.

OpenAI aims to democratize GPT-4 level intelligence, making it accessible to a broader audience, including free users.

GPT-4o boasts enhanced quality and speed across more than 50 languages, offering a more inclusive and globally accessible AI experience at a lower cost.

Paid subscribers will benefit from five times the capacity compared to non-paid users.

Additionally, OpenAI announced the release of a desktop version of ChatGPT, enabling real-time reasoning across audio, vision, and text interfaces for everyone.


Using the GPT-4o API

The new GPT-4o model is compatible with OpenAI's existing chat-completion API, ensuring it's easy to use and integrates seamlessly with previous setups.

```python
from openai import OpenAI

OPENAI_API_KEY = "<your-api-key>"

def openai_chat_resolve(response: dict, strip_tokens=None) -> str:
    strip_tokens = strip_tokens or []
    if response and response.get('choices'):
        content = response['choices'][0]['message']['content'].strip()
        if content:
            for token in strip_tokens:
                content = content.replace(token, '')
            return content
    raise Exception(f"Cannot resolve response: {response}")

def openai_chat_request(prompt: str, model_name: str, temperature=0.0) -> dict:
    message = {'role': 'user', 'content': prompt}
    client = OpenAI(api_key=OPENAI_API_KEY)
    return client.chat.completions.create(
        model=model_name,
        messages=[message],
        temperature=temperature,
    )

response = openai_chat_request(prompt="Hello!", model_name="gpt-4o-2024-05-13")
answer = openai_chat_resolve(response)
print(answer)
```


GPT-4o is also available using the ChatGPT interface:



Official Evaluation

OpenAI's blog post presents evaluation scores from well-known datasets like MMLU and HumanEval.



From the provided graph, GPT-4o's performance stands out as state-of-the-art in this domain, which is promising given its affordability and speed.

However, over the past year, several models have claimed state-of-the-art language performance across established datasets.

In reality, some models have been trained partially or overfitted on these datasets, leading to inflated scores on leaderboards. For further insights, refer to this paper.

Therefore, it's crucial to conduct independent analyses of these models' performance using lesser-known datasets, such as the one I've developed. 😄



My Evaluation Dataset 🔢

As detailed in previous articles, I've curated a specialized dataset to assess classification performance across various LLMs.

This dataset comprises 200 sentences categorized into 50 topics, intentionally designed to challenge classification tasks.

I meticulously created and labeled the entire dataset in English.

Subsequently, I employed GPT-4 (model: gpt-4–0613) to translate the dataset into multiple languages.

However, for this evaluation, we'll solely focus on the English version of the dataset to ensure unbiased results unaffected by potential biases from using the same language model for dataset creation and topic prediction.



Performance Results 📊

I conducted evaluations on the following models:


  • GPT-4o: gpt-4o-2024-05-13

  • GPT-4: gpt-4-0613

  • GPT-4-Turbo: gpt-4-turbo-2024-04-09

  • Gemini 1.5 Pro: gemini-1.5-pro-preview-0409

  • Gemini 1.0: gemini-1.0-pro-002

  • Palm 2 Unicorn: text-unicorn@001

The task assigned to the language models was to match each sentence in the dataset with the correct topic. This enabled us to calculate accuracy scores per language and determine each model's error rate.

Since the models mostly classified correctly, I plotted the error rate for each model. Remember, a lower error rate indicates better performance.



From the graph, we observe that GPT-4o has the lowest error rate among all models, making only 2 mistakes. Additionally, Palm 2 Unicorn, GPT-4, and Gemini 1.5 performed closely to GPT-4o, demonstrating strong performance.

Interestingly, GPT-4 Turbo performed similarly to GPT-4–0613. For more information on these models, refer to OpenAI’s model page.

Lastly, Gemini 1.0 exhibited lower performance, which aligns with its price range.



Exploring Multilingual Capabilities

In a recent article, I delved into the multilingual prowess of GPT-4o compared to other LLMs like Claude Opus and Gemini 1.5.


Unraveling Context: 

A Deep Dive into How GPT-4o and Gemini 1.5 Retain Context Through the 'Needle in the Haystack' Framework. Dive into the Details with the Link Below!



Conclusion 💡

Through an analysis utilizing a meticulously curated English dataset, valuable insights into the cutting-edge capabilities of these advanced language models emerge.

GPT-4o, OpenAI's latest innovation, distinguishes itself with the most minimal error rate among the models under examination, validating OpenAI's assertions regarding its prowess.

Both the AI community and users must continue conducting independent evaluations using diverse datasets. Such evaluations offer a more comprehensive understanding of a model's practical efficacy, transcending standardized benchmarks alone.

It's worth noting that the dataset used in this analysis is relatively small, and results may vary with different datasets. Furthermore, the performance evaluation was conducted solely with the English dataset, while a multilingual comparison remains a prospect for future examination.