Table of Contents
Gemini set out to beat ChatGPT
Google Deepmind has unveiled Gemini, a new AI model designed to compete with OpenAI’s ChatGPT. While both models are examples of “generative AI,” in which the computer learns to detect patterns in input training data to generate new data (images, words, or other media), ChatGPT is a large language model (LLM) that concentrates on producing text.
Similarly to ChatGPT, a web app for discussions that is based on the neural network known as GPT (trained on massive volumes of text), Google offers a conversational web app called Bard that is based on a model known as LaMDA (trained on dialogue). However, Google is now updating that to Gemini.
The fact that Gemini is a “multi-modal model” distinguishes it from other generative AI models such as LaMDA. This means that it directly supports several types of input and output: in addition to text input and output, it supports graphics, audio, and video. As a result, a new abbreviation is emerging: LMM (large multimodal model), which should not be confused with LLM.
The limitations
OpenAI announced the GPT-4Vision model in September, which can operate with images, audio, and text. It is not, however, a fully multimodal model in the sense that Gemini offers.
For example, while ChatGPT-4, which is powered by GPT-4V, can function with audio inputs and generate speech outputs, it cannot work with video inputs.
According to OpenAI, this is accomplished by transforming speech to text on input using another deep learning model called Whisper. ChatGPT-4 additionally translates text to speech on output using a different model, implying that GPT-4V is solely text-based.
Similarly, ChatGPT-4 can generate images, but it does so by sending text cues to a second deep learning model dubbed Dall-E 2, which translates word descriptions into images.
Google, on the other hand, developed Gemini to be “natively multimodal.” This means that the fundamental model can directly handle and produce a variety of input formats (audio, pictures, video, and text).
The Verdict
The distinction between these two approaches may appear intellectual, but it is crucial. According to Google’s technical study and other qualitative assessments, the current publicly available version of Gemini, named Gemini 1.0 Pro, is not as good as GPT-4 in general, and is more capable of GPT 3.5.
Google also introduced Gemini 1.0 Ultra, a more powerful version of Gemini, and provided some data demonstrating that it is more powerful than GPT-4. However, assessing this is problematic for two reasons. The first reason is that Google has not yet disclosed Ultra, thus results cannot currently be independently checked.
The second reason it’s difficult to evaluate Google’s claims is that it opted to create a slightly misleading demonstration movie, as shown below. The video shows the Gemini model reacting on a live video feed in an interactive and fluid manner.
However, as Bloomberg first reported, the demonstration in the video was not performed in real time. For example, the model had already mastered some specialized tasks, like as the three cup and ball trick, in which Gemini tracks where cup the ball is in. To do this, it was given a series of still photos in which the presenter’s hands are on the cups being swapped.
Despite the advent of numerous generative AI models over the last year, OpenAI’s GPT models have been dominant, displaying levels of performance that other models have not been able to match.
Google’s Gemini marks the arrival of a big competitor who will assist to propel the field ahead. Of course, OpenAI is almost certainly working on GPT-5, which will be multimodal and demonstrate astounding new capabilities.
Stay updated on all of the latest news by subscribing to the ITP Live newsletter below and by clicking the push notifications.