My Experience with Gemini 1.5 Pro: Surpassing GPT-4 and Gemini 1.0 Ultra

In brief,

Our experience with Gemini 1.5 Pro on Google AI Studio exceeded expectations. It outperformed previous Google models significantly.

The capability to handle a context length of 1 million tokens is revolutionary, surpassing even GPT-4's abilities.

Moreover, its native multimodal feature seamlessly processes videos, images, and diverse file formats, as demonstrated in our tests.

Google unveiled the latest iteration of the Gemini model, Gemini 1.5 Pro, a fortnight ago. Today, I gained access to its eagerly awaited 1 million token context window. Consequently, I set aside all my tasks, notified my Editor of my testing endeavor with the new Gemini model, and delved into the evaluation.

But before I delve into the comparison results between Gemini 1.5 Pro, GPT-4, and Gemini 1.0 Ultra, let's first review the fundamental aspects of the new Gemini 1.5 Pro model.

My Experience with Gemini 1.5 Pro: Surpassing GPT-4 and Gemini 1.0 Ultra

Introducing the Gemini 1.5 Pro AI Model:

After months of anticipation, Google's unveiling of the Gemini 1.5 Pro model marks a significant leap in multimodal Language and Learning Models (LLMs). Departing from the conventional dense model utilized in the Gemini 1.0 series, the Gemini 1.5 Pro model adopts a sophisticated Mixture-of-Experts (MoE) architecture.

Notably, this MoE architecture mirrors that of OpenAI's flagship GPT-4 model.

But the innovation doesn't stop there; boasting a staggering context length of 1 million tokens, the Gemini 1.5 Pro surpasses the 128K token limit of GPT-4 Turbo and the 200K token constraint of Claude 2.1. Google's internal testing, which pushed the model to handle up to 10 million tokens, underscores its impressive data ingestion capabilities, showcasing remarkable retrieval prowess.

Furthermore, despite its smaller size compared to the largest Gemini 1.0 Ultra model (accessible via Gemini Advanced), Google asserts that the Gemini 1.5 Pro model delivers comparable performance. Now, with all these bold assertions laid out, shall we proceed to put them to the test?

Gemini 1.5 Pro vs Gemini 1.0 Ultra vs GPT-4 Comparison

1. The Apple Test:

In my previous assessment comparing Gemini 1.0 Ultra with GPT-4, Google fell short against OpenAI in the standard Apple test, which evaluates the logical reasoning capabilities of Language and Learning Models (LLMs). However, the recent introduction of the Gemini 1.5 Pro model demonstrates a significant improvement, as it accurately answers the test question, indicating Google's advancement in sophisticated reasoning abilities with the Gemini 1.5 Pro model.

This marks Google's resurgence in the competition! Similar to before, GPT-4 once again provided a correct response, while Gemini 1.0 Ultra persisted in giving an incorrect answer, stating that there are 2 apples remaining.

Today, I woke up to a total of 3 apples in my possession. Yesterday, I indulged in one of them. Now, you might wonder, how many apples remain in my stash?

Winner: Gemini 1.5 Pro and GPT-4

2. The Towel Question:

During another test aimed at assessing the advanced reasoning prowess of Gemini 1.5 Pro, I presented the well-known towel question. Disappointingly, all three models—Gemini 1.5 Pro, Gemini 1.0 Ultra, and GPT-4—missed the mark.

None of these AI models grasped the fundamental essence of the question, resorting instead to mathematical computations, ultimately arriving at incorrect conclusions. It's evident that AI models still have a considerable distance to cover before they can reason on par with humans.

Let's ponder this scenario: If it takes an hour to dry 15 towels under the blazing sun, how much time would it require to dry 20 towels under the same conditions?

Winner: None

3. Which is Heavier:

Continuing my assessment, I conducted a tailored version of the weight evaluation test to assess the intricate reasoning skills of Gemini 1.5 Pro. To my satisfaction, both Gemini 1.5 Pro and GPT-4 performed admirably, passing the test with flying colors. However, Gemini 1.0 Ultra once again stumbled and failed to meet the mark.

Both Gemini 1.5 Pro and GPT-4 adeptly recognized the units involved, bypassing considerations of density, and confidently asserted that a kilo of any substance, be it feathers or otherwise, will always outweigh a pound of steel or any other material. Kudos to Google for this impressive achievement!

Here's a classic conundrum for you: Which weighs more, a kilo of feathers or a pound of steel?

Winner: Gemini 1.5 Pro and GPT-4

4. Solve a Maths Problem:

Thanks to Maxime Labonne's generosity, I got my hands on one of his math prompts to gauge Gemini 1.5 Pro's mathematical acumen. And let me tell you, Gemini 1.5 Pro aced the test effortlessly.

I also put GPT-4 through the same trial, and it too nailed the correct answer. But let's face it, we already know GPT is no slouch in this department. Oh, and just for the record, I made it clear to GPT-4 not to rely on the Code Interpreter plugin for mathematical computations. And as expected, Gemini 1.0 Ultra fell short and provided an incorrect output. I mean, why am I even bothering with Ultra in this test? sighs Let's move on to the next prompt, shall we?

Here's a little brain teaser for you: If \( x \) and \( y \) represent the tens and units digits, respectively, of the product \( 725,278 \times 67,066 \), what's the sum of \( x \) and \( y \)? And hey, can you think of the simplest way to figure this out without actually crunching through the entire number?

Winner: Gemini 1.5 Pro and GPT-4

5. Follow User Instructions:

Continuing our assessment, we transitioned to another test focusing on the ability of Gemini 1.5 Pro to accurately adhere to user instructions. We tasked it with generating 10 sentences, each concluding with the word "apple".

Unfortunately, Gemini 1.5 Pro fell short, managing to produce only three sentences meeting the criterion. In stark contrast, GPT-4 excelled, delivering nine sentences that fulfilled the requirement. As for Gemini 1.0 Ultra, it mustered only two sentences ending with the word "apple".

Please craft 10 sentences, ensuring each concludes with the word "apple".

Winner: GPT-4

6. Needle in a Haystack (NIAH) Test:

At the heart of Gemini 1.5 Pro lies its standout feature: the ability to manage an extensive context length of 1 million tokens. Google's rigorous testing on the New Internet Archive Holdings (NIAH) dataset yielded an impressive 99% retrieval rate with remarkable precision. Naturally, I felt compelled to conduct a similar assessment.

I selected one of the lengthiest Wikipedia articles, detailing the Spanish Conquest of Petén, boasting nearly 100,000 characters and consuming around 24,000 tokens. To challenge the AI models further, I strategically inserted a proverbial needle—a random statement—into the midst of the text, a tactic known to exacerbate performance in long-context scenarios according to researchers.

Gemini 1.5 Pro rose to the occasion, flawlessly identifying the statement with exceptional accuracy and contextual understanding. Conversely, GPT-4 faltered in locating the needle within the expansive text window. As for Gemini 1.0 Ultra, although purportedly supporting a 32K-context length, it currently operates with an 8K token limit through Gemini Advanced. Even with this reduced context, Gemini 1.0 Ultra still failed to pinpoint the targeted text statement.

In the realm of long-context retrieval, Gemini 1.5 Pro reigns supreme, showcasing Google's dominance over its AI counterparts.

Winner: Gemini 1.5 Pro

7. Multimodal Video Test:

Although GPT-4 boasts multimodal capabilities, it currently lacks the ability to process videos. Similarly, Gemini 1.0 Ultra is equipped with multimodal features, but Google has yet to activate video processing for this model, limiting its functionality within Gemini Advanced.

However, the game changes with Gemini 1.5 Pro, which I'm currently utilizing through Google AI Studio. This dynamic model not only allows for the upload of videos but also accommodates various file types, images, and even folders containing diverse file formats. Intrigued by its capabilities, I uploaded a 5-minute Beebom video, showcasing the OnePlus Watch 2 review—a content piece absent from the model's training data.

Remarkably, the model efficiently processed the video within a minute, utilizing a fraction of its extensive token capacity. Subsequently, I posed inquiries to Gemini 1.5 Pro regarding the video's content and features of the watch. Impressively, it promptly responded to each query with precision and coherence, even providing detailed information such as the reviewer's location and the color of the watch band.

Further highlighting its prowess, Gemini Pro swiftly generated a transcript of the video upon request, demonstrating its adeptness at analyzing visual data and deriving meaningful insights.

In essence, Gemini 1.5 Pro emerges as a formidable multimodal model, surpassing existing standards. As Simon Willison aptly describes in his blog, video functionality elevates Gemini 1.5 Pro to unparalleled heights, making it a true game-changer in the field.

Winner: Gemini 1.5 Pro

8. Multimodal Image Test:

For my final evaluation, I decided to put Gemini 1.5 Pro's visual prowess to the test. I uploaded a still image extracted from Google's demo video showcased during the Gemini 1.0 launch. In a previous trial, Gemini 1.0 Ultra stumbled in the image analysis assessment due to Google's withholding of the multimodal feature for the Ultra model within Gemini Advanced.

However, the response from Gemini 1.5 Pro was swift and accurate as it correctly identified the movie title as "The Breakfast Club". Similarly, GPT-4 provided the correct answer. In contrast, Gemini 1.0 Ultra struggled to process the image, attributing its failure to the presence of people's faces, a puzzling assertion considering the absence of any faces in the image.

Winner: Gemini 1.5 Pro and GPT-4

Insider's Verdict: Google's Breakthrough with Gemini 1.5 Pro

After spending the entire day exploring Gemini 1.5 Pro, it's safe to say that Google has delivered a game-changing innovation. The tech giant has introduced a remarkably robust multimodal model based on the MoE architecture, rivaling the capabilities of OpenAI's GPT-4 model.

Gemini 1.5 Pro shines in various aspects, outperforming GPT-4 in key areas such as commonsense reasoning, long-context retrieval, multimodal functionality, video processing, and compatibility with diverse file formats. And let's not forget, this assessment is based on the capabilities of the mid-size Gemini 1.5 Pro model; the forthcoming Gemini 1.5 Ultra model is expected to raise the bar even higher.

Currently in preview, Gemini 1.5 Pro is exclusively accessible to developers and researchers for testing and evaluation. However, prior to its broader release via Gemini Advanced, Google may implement additional safeguards that could potentially impact the model's performance, although I remain optimistic this won't be the case.

It's worth noting that upon its public release, Gemini 1.5 Pro won't offer the expansive 1 million token context window. Instead, users will have access to a still impressive 128,000 token context length. Nonetheless, developers can leverage the extended context window to create innovative solutions for end-users.

In addition to the Gemini announcement, Google has also introduced a range of lightweight Gemma models under an open-source license. However, the company faced scrutiny following an AI image generation incident involving Gemini, which merits further investigation.

Now, I'm curious to hear your thoughts on Gemini 1.5 Pro's performance. Are you thrilled to see Google reclaiming its position in the AI arena and poised to challenge OpenAI, especially in light of OpenAI's recent unveiling of Sora, its AI text-to-video generation model? Share your insights in the comments below.