The end of the era ImageNet
30.09.2021

The end of the era ImageNet

World-models are overtaking specialized and supervised models in performance and generalizability. With our new research we lift this development from the language space (with models like GPT-3) into a multimodal space, while retaining the impressive functionality known from large language models.

The supervised world for visual data

ImageNet was a tremendously influential dataset and publication. For the first time a computer vision dataset was available that was so big and diverse that it could unlock the potential of emerging larger deep networks. ImageNet paired an image with a label, one of more than 20.000 categories that were human annotated and describe the main motive of the image. This was the spark that ignited modern image classification and later object detection (when labels moved to bounding boxes and pixel masks). The following period of innovation was directly responsible for the success of my earlier start-up and the boom in autonomous systems overall.

World knowledge and understanding cannot be (only) supervised

While great at the time from today's perspective, especially when compared with the capabilities of self-supervised language models like GPT-3 the limitation to seemingly arbitrary categories, defined by PhDs after maybe too much coffee, seems very restrictive. What about the reflection of a pedestrian in a window front? Someone dressed up for carnival? Or a poster ad that shows a full-sized human?

For humans these observations are not a problem at all. We have all the world knowledge to understand these things and make sense of their implications. Putting this world knowledge into labels for a classifier to learn is hopeless (although we tried, one of the most noteworthy attempts by me and team was a dataset in which we had masks on layers on top of each other, with properties like "semi-transparent", "light-effect" and "reflection").

HCI Benchmark Suite

(Our) multimodal world model

When I looked at the early results of our multimodal model I knew that the era ImageNet is coming to an end. If we are smart about combining a giant language model with one (or more) image representations we can leverage language power and complexity.

The amazing thing about language is that it is specifically built to capture the complete relevant complexity of our world and map this to our understanding. This is why large language models are so powerful: they piggyback on millennia of human understanding encoded into standardized form. For very large models this seems to implicitly include things like answering detailed questions, writing summaries and comparing semantics and meaning.

If we succeed in combining this with a visual understanding we immediately gain all this power in a visual domain. Tell a story inspired by some images? Describe the difference of two observations? Answer a question about image content? read text? Learn new visual concepts? All those are suddenly possible with the almost unlimited power of language expression. Why would you ever go back to labels?

I am sure there are some good reasons for certain use-cases however looking at our first results I believe the era of ImageNet is over. Our model outperforms bert-based techniques for Outside Knowledge Visual Question Answering, as the model retains encyclopaedic knowledge from the language model. A more technical writeup by the research team is coming soon.

Thanks, ImageNet for all the fish.

To inspire your ideas - our large multimodal world-model:

  • almost unlimited representation power of language and images
  • language and images can be combined freely, in any order and quantity (e.g. by adding several images in a context with implicit meaning in the sequencial oder)
  • a known large language model tricks work (like QA)
  • learns new visual understanding few shot
  • better in reading than most dedicated OCR systems
  • can be combined with our other tools (WorldPointer, HybridInterface)
  • I can't wait what else our partners will discover
  • won't fit on your gaming GPU, sorry

Some examples out of the lab, still WIP but you can already see the possibilities:

World-class OCR combined with context understanding (what is relevant - how would a human answer)

Sense -> Visual Processing Sense -> Input Model

Learning and applying new concepts in the visual space based on one example

Sense -> Visual Processing Sense -> Input Model

Reading and understanding graphs and diagrams

Sense -> Visual Processing Sense -> Input Model

World knowledge included

Sense -> Visual Processing Sense -> Input Model

Blog

Further Articles

30.09.2021

The end of the era ImageNet

#Multimodality#AGI

World-models are overtaking specialized and supervised models in performance and generalizability. With our new research we lift this development from the language space (with models like GPT-3) into a multimodal space, while retaining the impressive functionality known from large language models.....

30.09.2021

Read more

25.02.2021

Multimodality: attention is all you need is all we needed

#Multimodality#AGI

When training our AI models, what we’re trying to build is a model of reality that captures the properties necessary to perform whatever task we’re trying to do....

25.02.2021

Read more

Load more

Press & Announcements

News
Public Relations
30.07.2021
Digitale Souveränität: Deutsches Start-up Aleph Alpha baut an OpenAI für Europa

Beyond Supervision bedeutet für [Andrulis], dass Maschinen künftig ein Weltverständnis jenseits der von Menschen zugeführten 'willkürlichen und arbiträren Signale' erwerben können. Die neue Generation der KI soll Maschinen in die Lage versetzen, flexibel auf Situationen und Kontexte zu reagieren, die nicht vordefiniert sind und die das System zuvor noch nicht gesehen hatte.

Silke Hahn

27.07.2021
German startup Aleph Alpha raises $27M Series A round to build ‘Europe’s OpenAI’

Jonas Andrulis, CEO and founder of Aleph Alpha said: “Aleph Alpha’s mission is to enable the accessibility, usability and integration of large, European multilanguage and multimodal AI models following the likes of GPT-3 and DALL-E, driving innovation for the explainability, alignment and integration.

Mike Butcher

19.07.2021
KI-Start-up Aleph Alpha sammelt 23 Millionen Euro Risikokapital ein

Die Finanzierung ist für Andrulis nur ein Zwischenschritt. In den kommenden Monaten will er das Risikokapital auf rund 100 Millionen Euro aufstocken. So viel sei nötig, um international mithalten zu können, meint der Gründer. „Mit der neuen Finanzierungsrunde ist es realistischer denn je, dass wir diese Summe zusammenbekommen“

Christoph Kapalschinski

19.07.2021
Für eine riesige Sprach-KI aus Deutschland

„GPT 3 ist nicht das Ende, sondern der Beginn einer Entwicklung“, sagt [Andrulis], und prognostiziert: „Ich gehe davon aus, dass GPT 4 oder GPT 5 tatsächlich das ganze Wissen der Welt verarbeiten können.“

Alexander Armbruster

07.05.2021
KI-Texte: Ein deutsches Start-up will OpenAI Konkurrenz machen

Digitale Souveränität ist für Kritiker ein Schlagwort, um Subventionen abzugreifen. Und wieder andere machen einfach. Dazu gehört Jonas Andrulis. Der ehemalige Innovationsmanager von Apple baut in Heidelberg ein Start-up auf, das mit Künstlicher Intelligenz (KI) Texte erzeugt. Vorbild ist das US-Unternehmen Open AI, das dank Unterstützung von Bill Gates und Elon Musk auf enorme Ressourcen zurückgreifen kann.

Christoph Kapalschinski

09.03.2021
Der Hype um KI-Start-ups ist vorbei – jetzt kommt es auf Qualität an

Doch es gibt Lichtblicke. Der frühere Apple-Manager Jonas Andrulis etwa hat wenige Monate nach Gründung 5,3 Millionen Euro für sein Heidelberger Start-up Aleph Alpha bei europäischen Kapitalgebern eingesammelt. Ziel ist die Entwicklung einer KI für die europäische Sprachvielfalt, die mittelfristig mit dem viel beachteten Sprachgenerator OpenAI mithalten soll. „Es ist positiv, dass auch deutsche Gründer solche großen Themen angehen“, sagt Hartmann.

Christoph Kapalschinski

03.03.2021
Robo-writers: the rise and risks of language-generating AI

Leahy, currently a researcher at the start-up firm Aleph Alpha in Heidelberg, Germany, now leads an independent group of volunteer researchers called EleutherAI, which is aiming to create a GPT-3-sized model. The biggest hurdle, he says, is not code or training data but computation [...]

Matthew Hutson

19.02.2021
Made in Germany – noch. Die neuen Gründer sind Deutschlands letzte Chance. Sie sichern mit avancierten Technologien die Industriejobs von morgen.

[Andrulis] ist überzeugt: „Die KI, die hinter modernen Verwaltungen und Regierungen stehen wird, muss europäisch sein.“ Nur so lasse sich sicherstellen, dass sensible Informationen nicht missbraucht würden. Und dass in diesem Geschäft auch Unternehmen mitmischen, die in Europa Arbeitsplätze schaffen und Steuern zahlen.

Dominik Reintjes, Thomas Stölzel

27.01.2021
Aleph Alpha erhält 5,3 Millionen Euro für europäischen OpenAI-Konkurrenten

Aleph Alpha will dem US-amerikanischen OpenAI ein europäisches KI-Pendant gegenüberstellen. Es soll europäischen Werten und dem Datenschutz entsprechen.

Oliver Bünte

27.01.2021
HEIDELBERGER START-UP ALEPH ALPHA: Deutscher Ex-Apple-Manager plant eine KI für Europa

Jonas Andrulis war von seiner Arbeit als hochrangiger KI-Entwickler bei Apple enttäuscht. Nun will er dem Valley auf eigene Faust Konkurrenz machen.

Christoph Kapalschinski

30.12.2020
Europa muss dieses Projekt kopieren, sonst verliert es den Anschluss

Zusammen mit KI-Firmen wie Aleph Alpha in Heidelberg und den europäischen KI-Forschungsnetzwerken Claire und Ellis, die alle moderne KI-Verfahren vorantreiben, kann Europa all das selbst in die Hand nehmen. Noch ist Zeit.

Prof. Dr. Kristian Kersting

28.11.2020
Machine Learning Street Talk #031 WE GOT ACCESS TO GPT-3! (With Gary Marcus, Walid Saba and Connor Leahy)

In this special edition, Dr. Tim Scarfe, Yannic Kilcher and Dr. Keith Duggar speak with Professor Gary Marcus, Dr. Walid Saba and Connor Leahy (Aleph Alpha) about GPT-3. We have all had a significant amount of time to experiment with GPT-3 and show you demos of it in use and the considerations. Do you think GPT-3 is a step towards AGI?

Yannic Kilcher, Dr. Tim Scarfe, Dr. Keith Duggar

Contact