Apple, in collaboration with the University of California, Santa Barbara, has introduced MLLM-Guided Image Editing (MGIE) as an AI image editing tool. Users can describe desired edits in plain language without directly interacting with photo editing software. This model allows users to crop, resize, flip, or apply filters through text commands, accommodating both simple and complex editing tasks.
Why is this important?
The introduction of instruction-based image editing has improved the control and flexibility of manipulating images using natural language commands, eliminating the need for detailed descriptions or regional masks. However, the brevity of human instructions poses challenges for existing methods. Multimodal large language models (MLLMs) exhibit potential in cross-modal understanding and visual-aware response generation.
To address this, the team came up with MGIE, a model that learns to generate expressive instructions for image manipulation. MGIE provides explicit guidance, incorporating visual imagination into the editing process through end-to-end training.
The experiments cover various aspects of image editing, including Photoshop-style modification, global photo optimization, and local editing. The results demonstrate that expressive instructions are crucial for instruction-based image editing, and MGIE outperforms existing methods in terms of both automatic metrics and human evaluation.
Visual design tools often require prior knowledge to operate, and text-guided image editing has gained popularity for its enhanced controllability and accessibility. The authors note that diffusion models, which can model realistic images, have been adopted for image editing tasks. However, they highlight the limitations of existing methods, such as the ambiguity and insufficiency of human instructions and the challenges posed by static descriptions in pre-trained models like CLIP.
Inspired by the capabilities of MLLMs, researchers propose MGIE to address these limitations and improve instruction-based image editing. They emphasize the importance of visual-aware expressive instructions derived from MLLMs, which are shown to enhance the controllability and practicality of image editing tasks. The paper provides a detailed overview of the method, including the incorporation of MLLMs, the generation of concise expressive instructions, and the end-to-end training process.