A principal applied scientist at Microsoft has “successfully” gotten OpenAI’s multimodal variant, GPT-4V, to play DOOM, resulting in the researcher issuing a warning about the implications of advanced AI-powered systems.

The Microsoft scientist is Adrian de Wynter, who is also a researcher at the University of York in England, penned a recent research paper with the title “Will GPT-4 Run DOOM?“. In short, GPT-4 cannot run DOOM as the large language model is unable to execute DOOM’s source code directly. However, its multimodal variant, GPT-4V, or GPT-4 with Vision, which is a large language model designed to take in images and answer questions about them, was capable of acting as a “proxy for the engine“.

To achieve this, Wynter designed a process that includes a Manager capturing screenshots of the game engine that are then sent to GPT-4V, which returns natural-language descriptions of those images. Those descriptions are then sent back to the Manager, which sends them to GPT-4, which replies with decisions based on the history it has been sent. Wynter combined that process with an Agent model that’s designed to translate its responses into keystroke commands that are then entered into the game.


GPT-4 can play Doom to a passable degree. Now,“passable”here means“it actually does what it is supposed to, but kinda fails at the game and my grandma has played it wayyy better than this model”. What is interesting is that indeed, more complex call schemes (e.g., having a“planner”generate short versions of what to do next; or polling experts) do lead to better results,” writes Wynter

The result is an AI-powered bot capable of opening doors, fighting enemies, firing weapons and even executable instructions that can improve its own performance. It isn’t without its problems though as the model appears to lack object permanence, or the ability to remember where objects or in particular enemies are located. Once an in-game zombie goes off-screen the AI immediately forgets about it, moving on with the rest of the game.


Naive run

More specifically, if GPT-4 forgets about a zombie it continues on with the game and eventually gets stuck in a corner getting hit by the zombie it forgot about, despite it being instructed what to do if it’s taking damage and cannot see the enemy that is causing damage. Out of the nearly 50-60 runs Wynter conducted he observed it twice turn around and deal with the enemy.

Even though the AI model is playing the game poorly Wynter says he still considers it incredible that GPT-4 is able to achieve this level of interaction with DOOM without any prior training. Simultaneously, Wynter says it’s equally as troubling as it is impressive as it was quite easy to construct this entire process and the AI model simply doesn’t second guess any of its instructions.

Wynter says his process could have applications in automated video game testing, but warns the model is clearly unaware of what its doing. However, Wynter writes that since it’s a very common issue in video games to get the game extensively playtested, it would be possible to at least automate some parts of that playtesting by a large language model.

On the ethics department, it is quite worrisome how easy it was for (a) me to build code to get the model to shoot something; and (b) for the model to accurately shoot something without actually second-guessing the instructions,” he wrote in his summary post.

So, while this is a very interesting exploration around planning and reasoning, and could have applications in automated video game testing, it is quite obvious that this model is not aware of what it is doing. I strongly urge everyone to think about what deployment of these models [implies] for society and their potential misuse.

