ModelMicrosoftMicrosoftpublished May 7, 2026seen 5d

microsoft/Phi-Ground-Any

Open original ↗

Captured source

source ↗
published May 7, 2026seen 5dcaptured 9hhttp 200method plainlicense mitdownloads 1.4klikes 16

Microsoft Phi-Ground-Any-4B

🤖 HomePage | 📄 Paper | 📄 Arxiv | 😊 Model | 😊 Eval data

![overview](docs/images/intro.png)

Phi-Ground-Any-4B is one of the Phi-Ground model family, finetuned from microsoft/Phi-3.5-vision-instruct with fixed input resolution 1680x1008.

Main results

![overview](docs/images/r1.png)

Usage

The current transformers version can be verified with: pip list | grep transformers.

Examples of required packages:

flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0

Input Formats

The model require strict input format including fixed image resolution, instruction-first order and system prompt.

Input preprocessing

from PIL import Image

def process_image(img):
# Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008).
target_width, target_height = 336 * 5, 336 * 3

img_ratio = img.width / img.height
target_ratio = target_width / target_height

if img_ratio > target_ratio:
new_width = target_width
new_height = int(new_width / img_ratio)
else:
new_height = target_height
new_width = int(new_height * img_ratio)
reshape_ratio = new_width / img.width

img = img.resize((new_width, new_height), Image.LANCZOS)
new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))
paste_position = (0, 0)
new_img.paste(img, paste_position)
return new_img, reshape_ratio

# Phi-Ground-Anything takes the user instruction directly (no "describe the
# element" wrapper) and is trained to emit the click point as
# VALUEVALUE
# where VALUE is a relative coordinate in [0, 10000] over the padded canvas
# (i.e., divide by 10000 and multiply by target_width / target_height to get
# pixel coords in the padded image, then divide by reshape_ratio to recover
# coords in the ORIGINAL image).
instruction = ""
prompt = """
{instruction}

""".format(instruction=instruction)

image_path = ""
original_image = Image.open(image_path).convert("RGB")
image, reshape_ratio = process_image(original_image)

# ---------------------------------------------------------------------------
# Example: parse the model output and recover original-image coordinates.
# ---------------------------------------------------------------------------
import re

target_width, target_height = 336 * 5, 336 * 3
SCALE = 10000.0

x_pattern = re.compile(r"\s*(-?\d+(?:\.\d+)?)\s*")
y_pattern = re.compile(r"\s*(-?\d+(?:\.\d+)?)\s*")

def parse_xy(model_output: str):
xs = [float(v) for v in x_pattern.findall(model_output)]
ys = [float(v) for v in y_pattern.findall(model_output)]
return list(zip(xs, ys))

def to_original_pixel(rel_xy, reshape_ratio: float):
x_rel, y_rel = rel_xy
px = (x_rel / SCALE) * target_width / reshape_ratio
py = (y_rel / SCALE) * target_height / reshape_ratio
return px, py

# model_output = "48233120"
# point_orig = to_original_pixel(parse_xy(model_output)[0], reshape_ratio)

Then you can use huggingface model or vllm to inference.

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Small release, moderate traction.