Thinking with images
Captured source
source ↗Thinking with images | OpenAI
April 16, 2025
Thinking with images
Loading…
Share
OpenAI o3 and o4-mini are the latest visual reasoning models in our o-series. For the first time, our models can think with images in their chain-of-thought—not just see them.
Similar to our earlier OpenAI o1 model, o3 and o4-mini are trained to think for longer before answering—and use a long internal chain of thought before responding to the user. o3 and o4-mini further extend this capability by thinking with images in their chain-of-thought, which is achieved by transforming user uploaded images with tools, allowing them to crop, zoom in, and rotate, in addition to other simple image processing techniques. More importantly, these capabilities come natively, without relying on separate specialized models.
ChatGPT’s enhanced visual intelligence helps you solve tougher problems by analyzing images more thoroughly, accurately, and reliably than ever before. It can seamlessly combine advanced reasoning with tools like web search and image manipulation—automatically zooming, cropping, flipping, or enhancing your images—to extract insights even from imperfect photos. For example, you can upload a photo of an economics problem set to receive step-by-step explanations, or share a screenshot of a build error to quickly get a root-cause analysis.
This approach enables a new axis for test-time compute scaling that seamlessly blends visual and textual reasoning, as reflected in their state-of-the-art performance across multimodal benchmarks, marking a significant step toward multimodal reasoning.
Visual reasoning in action
Thinking with images allows you to interact with ChatGPT more easily. You can ask questions by taking a photo without worrying about the positioning of objects—whether the text is upside down or there are multiple physics problems in one photo. Even if objects are not obvious at first glance, visual reasoning allows the model to zoom in to see more clearly.
All examples were completed with OpenAI o3.
What is written on the notebook?
Reasoned about notebook contents for 20 seconds
Analyzed image
Analyzed image
- The user posted an image at /mnt/data/own_notebook.jpg and asked me to read what's written on it. I'll need to load the image so I can inspect the text. Once I view it, I realize the text is upside down, so I’ll rotate it so it's readable.From there, I can check what's written and share my findings clearly with the user.
Analyzed image
- The user shared an image found at /mnt/data/own_notebook.jpg and asked me to read the text on it. I’ll load the image and zoom in to see the writing clearly. It turns out the text is upside down, so I’ll rotate it for easy reading. Afterward, I’ll crop the lower part of the image to focus on the relevant area, and rotate the cropped section to ensure the text is right side up.
Analyzed image
Done
It says: “4th February – finish roadmap.”
Our latest visual reasoning models work in tandem with other tools like Python data analysis, web search, image generation to creatively and effectively solve more complex problems, delivering our first multimodal agentic experience to users.
solve this maze and plot a path with red-colored line
Reasoned about maze solution and path for 1m 44s
Analyzed image
- The user provided an image path, but I’ve yet to load it. First, I’ll need to load the image from the given path for analysis. Once that’s done, I can detect the maze path, potentially marked by black lines, and plot a red line to showcase it. The image is relatively small—404x
- I’m thinking the start could be at the top left missing border piece, or maybe near the left border. To help figure this out, I’ll look at the first row pixels to find any white areas near the border. Since the maze walls are black on white, I’ll try a thresholding approach. To invert it, I’ll use a threshold value of 200, which would set white to 255 and black closer to 0. I’ll apply this threshold using the inverse method.
- Okay, I initially thought I could approximate the maze start near the top-left opening. I applied some thresholding to detect white pixels but encountered an issue—nothing changed. It seems the image might have high noise. Oddly, when inspecting the grayscale, all pixel values are zero, which suggests the image is completely black. This doesn’t match what we saw earlier with Matplotlib, so I’ll delve deeper into how PIL loaded the image and inspect the underlying pixel array.
Analyzed
Unknown component type: componentCodeExample
Analyzed
Unknown component type: componentCodeExample
It looks like the lines are black with a transparent background, but when PIL loads the image, it's showing a white background due to how it's handling the alpha channel. I’ll try converting the image to RGB, but the pixel values are still showing all zeros. This is odd, since the black lines should be visible. Hmm, earlier the display had a white background. Let me check the pixel values at coordinate (200, 200) to get a clearer picture.
Analyzed
Unknown component type: componentCodeExample
Analyzed
Unknown component type: componentCodeExample
- The image has black lines with a transparent background, but it appears white due to how PIL handled the alpha channel. When displayed earlier, I saw white areas, which were actually transparent areas shown on a white grid (the Matplotlib background color). So, the lines are black with the transparent spaces representing the open path. I'll need to treat these transparent areas as the open path when interpreting the maze. It’s essentially a black line maze on a clear canvas.
- I see that the maze walls are black pixels, and the path is transparent with an alpha value of 0. I need to find the path from an open cell on one border to another on the opposite border. To do this, I'll create a binary matrix where the passable cells have an alpha of 0 and are not part of a line border. Now, I'll focus on processing the image channels to identify the path.
Analyzed
Unknown component type: componentCodeExample
- The walls are black, and the path is transparent (alpha 0). I need to find a way from the open cell at one border to the open cell at the opposite border. I’ll create a binary matrix where the passable cells are transparent (alpha 0). However, there are a lot of transparent cells, and the paths might not be clear due to line gaps. I plan to treat the lines as about 3 pixels thick. I'll try BFS/DFS…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Substantive multimodal AI research post