Unveiling Gemini Ultra: A Comparative Analysis with GPT-4 Reveals Google's Ongoing Pursuit for the Perfect Recipe

Summing Up:

Google's recent launch of the Ultra 1.0 model, fueling Gemini Advanced, sparks our comparison with GPT-4.

Gemini Advanced lags behind GPT-4 in commonsense reasoning, lacking the intuitive edge when handling logical queries.

While Gemini Ultra excels in creative writing and shows improved coding performance, the Multimodal feature remains absent from both Ultra and Pro models.

After introducing the Gemini family of models about two months ago, Google has now unveiled its most extensive and capable model yet, the Ultra 1.0, within the Gemini framework, formerly known as Bard. Positioned as the next evolutionary step in the Gemini era, Google aims to surpass OpenAI's widely utilized GPT-4 model, which debuted nearly a year ago. Today, we delve into a comprehensive comparison between Gemini Ultra and GPT-4, analyzing their performance in commonsense reasoning, coding proficiency, multimodal capabilities, and beyond. So, let's dive into the Gemini Ultra vs GPT-4 showdown.

Unveiling Gemini Ultra: A Comparative Analysis with GPT-4 Reveals Google's Ongoing Pursuit for the Perfect Recipe

Important Reminder:

In this comparison, we're juxtaposing OpenAI's GPT-4 with Google's Gemini Ultra 1.0 model, accessible through the subscription-based Gemini Advanced service.

1. The Apple Challenge:

In our initial logical reasoning assessment, dubbed the "Apple test," Gemini Ultra falls short against GPT-4. Despite Google's assertions of the Ultra model's superior capabilities within the Gemini Advanced subscription, Gemini Ultra stumbles when confronted with a basic commonsense reasoning query.

Scenario:

I had 3 apples today, and I ate one yesterday. How many apples do I have now?

Winner: GPT-4

2. Weighing the Options:

In yet another reasoning assessment, Google's Gemini once more lags behind GPT-4, marking a rather disappointing outcome. Gemini Ultra erroneously asserts that 1,000 bricks weigh the same as 1,000 feathers, a statement clearly untrue. Another victory for GPT-4!

Scenario:

Which weighs more, 1000 pieces of bricks or 1000 pieces of feathers?

Winner: GPT-4

3. The Apple Finale:

In our subsequent evaluation pitting Gemini against GPT-4, we tasked both LLMs with crafting 10 sentences concluding with the word "Apple".

Out of the 10 sentences generated, GPT-4 impressively produced eight, showcasing its adeptness. In stark contrast, Gemini only managed to deliver three sentences, marking a notable shortfall for Gemini Ultra. Despite claims of meticulous instruction-following, Gemini falters in real-world application.

Task:

Generate 10 sentences that end with the word 'apple'.

Winner: GPT-4

4. Deciphering the Sequence:

We challenged both cutting-edge models from Google and OpenAI to discern a pattern and predict the next outcome. In this examination, Gemini Ultra 1.0 successfully identified the pattern but stumbled in providing the correct answer. Conversely, GPT-4 exhibited a strong grasp of the pattern and delivered the accurate solution.

It seems that despite its advancements, Gemini Advanced, powered by the new Ultra 1.0 model, still lacks thorough analytical reasoning in its responses. In contrast, while GPT-4 may appear aloof, it consistently delivers accurate results.

Sequence:

July, August, October, January, May, ?

Winner: GPT-4

5. Navigating the Haystack Challenge:

The Needle in a Haystack challenge, devised by Greg Kamradt, has emerged as a popular benchmark for assessing accuracy, particularly with LLMs operating within expansive context lengths. This challenge evaluates a model's ability to recall and extract a specific statement (the "needle") from an extensive body of text.

In this instance, I presented both models with a sample text exceeding 3,000 tokens and spanning 14,000 characters, tasking them to locate the specified statement within the text.

Regrettably, Gemini Ultra faltered in processing the text entirely, whereas GPT-4 effortlessly retrieved the statement while also acknowledging its unfamiliarity with the broader narrative. Despite both models boasting a context length of 32,000, Google's Ultra 1.0 model failed to execute the task.

Winner: GPT-4

6. Cracking the Code:

In a recent coding assessment, I tasked both Gemini and GPT-4 with modifying code to render the Gradio interface public, and remarkably, both provided the correct solution. Notably, when I previously tested the same code on Bard, powered by the PaLM 2 model, it yielded an incorrect response. This indicates a marked improvement in Gemini's coding proficiency. Even the free version of Gemini, driven by the Pro model, yielded the correct solution.

Prompt:

I want to make this Gradio interface public. What change should I make in the code provided below?

iface = gr.Interface(fn=chatbot,

inputs=gr.components.Textbox(lines=7, label="Enter your text"),

outputs="text",

title="Custom-trained AI Chatbot")

index = construct_index("docs")

iface.launch()

Winner: Tie

7. Crunching the Numbers:

In a lighthearted math challenge, I presented both LLMs with a problem, and to my delight, both performed admirably. To ensure fairness, I requested that GPT-4 refrain from utilizing the Code Interpreter for mathematical computations, given that Gemini lacks a comparable tool at present.

Winner: Tie

8. Unleashing Creativity:

Gemini Ultra truly shines in the realm of creative writing, surpassing GPT-4 in this domain. Over the weekend, I extensively tested the Ultra model for creative tasks, and its performance has been consistently remarkable. In contrast, GPT-4's responses tend to carry a colder, more robotic tone.

Notably, Ethan Mollick has echoed similar sentiments in his comparison of both models.

For those seeking an AI model adept at creative writing, Gemini Ultra emerges as a standout choice. When coupled with the latest insights from Google Search, Gemini transforms into an exceptional tool for researching and crafting content on any subject.

Winner: Gemini Ultra

9. Crafting Visuals:

While both models offer image generation via DALL-E 3 and Imagen 2, OpenAI's image generation capability outshines Google's text-to-image model. However, when it comes to faithfully adhering to instructions during image generation, DALL-E 3 (integrated within GPT-4 in ChatGPT Plus) falls short and occasionally produces hallucinations. In contrast, Imagen 2 (integrated with Gemini Advanced) impeccably follows instructions without any hallucinations. In this aspect, Gemini surpasses GPT-4.

Prompt:

Create a picture of an empty room with no elephant in it. Absolutely no elephant anywhere in the room.

Winner: Gemini Ultra

10. Crack the Movie Code:

Upon Google's unveiling of the Gemini model two months ago, it showcased several intriguing concepts. The demonstration highlighted Gemini's multimodal capability, enabling it to comprehend multiple images and discern the underlying connections. However, when I uploaded an image from the video, Gemini failed to guess the movie, whereas GPT-4 succeeded on its first attempt.

A Google employee, on X (formerly Twitter), confirmed that the multimodal capability remains inactive for both Gemini Advanced (powered by the Ultra model) and Gemini (powered by the Pro model). Consequently, image queries do not currently utilize the multimodal models.

This clarifies Gemini Advanced's performance in this test. To conduct a genuine multimodal comparison between Gemini Advanced and GPT-4, we must await the addition of this feature by Google.

Prompt:

Given the play on words of these images, guess the name of the movie.

Winner: GPT-4

Thanks for reading