近日,昆仑万维携手北京智源人工智能研究院、新加坡南洋理工大学、北京大学等顶尖名校机构,联合提出了迄今为止第一个既能玩多种商业游戏又能操作各种软件应用的AI框架——Cradle。在这个全新的通用计算机控制框架加持下,AI Agent无需训练便能像人一样直接控制键盘鼠标,不依赖任何内部API,实现任意开闭源软件交互。
https://mp.weixin.qq.com/s/KYci5XDYS7hGHWYj7PwGhQ
https://baai-agents.github.io/Cradle
The Cradle framework empowers nascent foundation models to perform complex computer tasks via the same general interface humans use: screen as input and keyboard & mouse operations as output.
Building foundation agents that can master ANY computer task via the universal human-style interface by receiving input from screens and audio and outputting keyboard and mouse actions.
To pursue GCC, we propose Cradle, a modular and flexible LMM-powered framework that can properly handle the challenges GCC presents. The framework should have the ability to understand and interpret computer screens and dynamic changes between consecutive frames from arbitrary software and be able to generate reasonable computer control actions to be executed precisely. This suggests that a multimodal model with powerful vision and reasoning capabilities, in addition to rich knowledge of computer UI and control, is a requirement. In this work, we leverage GPT-4o as the framework’s backbone model.
Cradle is composed of six key modules: 1) information gathering to process multimodal input, 2) self-reflection to rethink past experiences, 3) task inference for choosing the best next task, 4) skill curation for generating and updating relevant skills for a given task, 5) action planning for deciding on specific executable actions for keyboard and mouse control, and 6) memory for storage and retrieval of past experiences and known skills.
Cradle一共由6个模块组成:信息收集、自我反思、任务推断、技能管理、行动规划,以及记忆模块。
Cradle高度的通用性,来源于其对和电脑交互过程中的原始输入输出的合理封装和抽象。以从屏幕中显示的视频图像作为输入,提取其中的文本和视觉信息进行决策,并且输出最底层的操作系统中控制键盘和鼠标的信号去和电脑交互,使其可以不依赖于任何假设和任何内部API进行交互。
同时,Cradle强大的决策推理模块让其得以自发和软件进行交互并且完成任务,这个过程可以被简单地总结为:反思过去,总结现在,规划未来。
反思过去:Cradle使用执行过往动作过程的视频作为输入,分别提取出其中关键的文本和视觉信息,通过反思来判断上一步动作是否执行成功任务是否完成以及如何改进。
总结现在:反思完之后,Cradle需要总结当前情况,并且以此为根据来决定是否更换任务目标或是修改任务内容。
规划未来:最后Cradle会根据当前任务以及现状生成或者更新自身的技能,并且从已经学会的技能中检索一部分和当前任务相关的技能作为备选,然后从中选取合适的技能实例化为动作去执行。
发表回复