{"id":420,"date":"2024-07-12T15:10:18","date_gmt":"2024-07-12T07:10:18","guid":{"rendered":"https:\/\/aitimes.link\/?p=420"},"modified":"2024-07-12T15:10:18","modified_gmt":"2024-07-12T07:10:18","slug":"general-computer-control-cradle","status":"publish","type":"post","link":"https:\/\/aitimes.link\/index.php\/2024\/07\/12\/general-computer-control-cradle\/","title":{"rendered":"General Computer Control- Cradle"},"content":{"rendered":"\n<p>\u8fd1\u65e5\uff0c\u6606\u4ed1\u4e07\u7ef4\u643a\u624b\u5317\u4eac\u667a\u6e90\u4eba\u5de5\u667a\u80fd\u7814\u7a76\u9662\u3001\u65b0\u52a0\u5761\u5357\u6d0b\u7406\u5de5\u5927\u5b66\u3001\u5317\u4eac\u5927\u5b66\u7b49\u9876\u5c16\u540d\u6821\u673a\u6784\uff0c\u8054\u5408\u63d0\u51fa\u4e86<strong>\u8fc4\u4eca\u4e3a\u6b62\u7b2c\u4e00\u4e2a\u65e2\u80fd\u73a9\u591a\u79cd\u5546\u4e1a\u6e38\u620f\u53c8\u80fd\u64cd\u4f5c\u5404\u79cd\u8f6f\u4ef6\u5e94\u7528\u7684AI\u6846\u67b6\u2014\u2014Cradle\u3002<\/strong>\u5728\u8fd9\u4e2a\u5168\u65b0\u7684\u901a\u7528\u8ba1\u7b97\u673a\u63a7\u5236\u6846\u67b6\u52a0\u6301\u4e0b<strong>\uff0cAI Agent\u65e0\u9700\u8bad\u7ec3\u4fbf\u80fd\u50cf\u4eba\u4e00\u6837\u76f4\u63a5\u63a7\u5236\u952e\u76d8\u9f20\u6807\uff0c\u4e0d\u4f9d\u8d56\u4efb\u4f55\u5185\u90e8API\uff0c\u5b9e\u73b0\u4efb\u610f\u5f00\u95ed\u6e90\u8f6f\u4ef6\u4ea4\u4e92<\/strong>\u3002<br><a href=\"https:\/\/mp.weixin.qq.com\/s\/KYci5XDYS7hGHWYj7PwGhQ\">https:\/\/mp.weixin.qq.com\/s\/KYci5XDYS7hGHWYj7PwGhQ<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/baai-agents.github.io\/Cradle\">https:\/\/baai-agents.github.io\/Cradle<\/a><br>The Cradle framework empowers nascent foundation models to perform complex computer tasks via the same general interface humans use: screen as input and keyboard &amp; mouse operations as output.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"334\" src=\"https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-1024x334.png\" alt=\"\" class=\"wp-image-421\" srcset=\"https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-1024x334.png 1024w, https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-300x98.png 300w, https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-768x250.png 768w, https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-1536x501.png 1536w, https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-2048x668.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Building&nbsp;<strong>foundation agents<\/strong>&nbsp;that can master&nbsp;<strong>ANY<\/strong>&nbsp;computer task via the universal human-style interface by receiving input from screens and audio and outputting keyboard and mouse actions.<\/p>\n<\/blockquote>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/baai-agents.github.io\/Cradle\/images\/Path_towards_AGI.png\" alt=\"\"\/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"515\" src=\"https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-1-1024x515.png\" alt=\"\" class=\"wp-image-432\" srcset=\"https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-1-1024x515.png 1024w, https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-1-300x151.png 300w, https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-1-768x386.png 768w, https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-1-1536x772.png 1536w, https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-1-2048x1030.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>To pursue GCC, we propose&nbsp;<strong>Cradle<\/strong>, a modular and flexible LMM-powered framework that can properly handle the challenges GCC presents. The framework should have the ability to understand and interpret computer screens and dynamic changes between consecutive frames from arbitrary software and be able to generate reasonable computer control actions to be executed precisely. This suggests that a multimodal model with powerful vision and reasoning capabilities, in addition to rich knowledge of computer UI and control, is a requirement. In this work, we leverage GPT-4o as the framework&#8217;s backbone model.<\/p>\n\n\n\n<p><strong>Cradle<\/strong>\u00a0is composed of six key modules: 1)\u00a0<strong>information gathering<\/strong>\u00a0to process multimodal input, 2)\u00a0<strong>self-reflection<\/strong>\u00a0to rethink past experiences, 3)\u00a0<strong>task inference<\/strong>\u00a0for choosing the best next task, 4)\u00a0<strong>skill curation<\/strong>\u00a0for generating and updating relevant skills for a given task, 5)\u00a0<strong>action planning<\/strong>\u00a0for deciding on specific executable actions for keyboard and mouse control, and 6)\u00a0<strong>memory<\/strong>\u00a0for storage and retrieval of past experiences and known skills.<\/p>\n\n\n\n<p>Cradle\u4e00\u5171\u75316\u4e2a\u6a21\u5757\u7ec4\u6210\uff1a<strong>\u4fe1\u606f\u6536\u96c6\u3001\u81ea\u6211\u53cd\u601d\u3001\u4efb\u52a1\u63a8\u65ad\u3001\u6280\u80fd\u7ba1\u7406\u3001\u884c\u52a8\u89c4\u5212<\/strong>\uff0c\u4ee5\u53ca<strong>\u8bb0\u5fc6<\/strong>\u6a21\u5757\u3002<\/p>\n\n\n\n<p>Cradle\u9ad8\u5ea6\u7684\u901a\u7528\u6027\uff0c\u6765\u6e90\u4e8e\u5176\u5bf9\u548c\u7535\u8111\u4ea4\u4e92\u8fc7\u7a0b\u4e2d\u7684\u539f\u59cb\u8f93\u5165\u8f93\u51fa\u7684\u5408\u7406\u5c01\u88c5\u548c\u62bd\u8c61\u3002\u4ee5\u4ece\u5c4f\u5e55\u4e2d\u663e\u793a\u7684\u89c6\u9891\u56fe\u50cf\u4f5c\u4e3a\u8f93\u5165\uff0c\u63d0\u53d6\u5176\u4e2d\u7684\u6587\u672c\u548c\u89c6\u89c9\u4fe1\u606f\u8fdb\u884c\u51b3\u7b56\uff0c\u5e76\u4e14\u8f93\u51fa\u6700\u5e95\u5c42\u7684\u64cd\u4f5c\u7cfb\u7edf\u4e2d\u63a7\u5236\u952e\u76d8\u548c\u9f20\u6807\u7684\u4fe1\u53f7\u53bb\u548c\u7535\u8111\u4ea4\u4e92\uff0c\u4f7f\u5176\u53ef\u4ee5\u4e0d\u4f9d\u8d56\u4e8e\u4efb\u4f55\u5047\u8bbe\u548c\u4efb\u4f55\u5185\u90e8API\u8fdb\u884c\u4ea4\u4e92\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"556\" src=\"https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-2-1024x556.png\" alt=\"\" class=\"wp-image-433\" srcset=\"https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-2-1024x556.png 1024w, https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-2-300x163.png 300w, https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-2-768x417.png 768w, https:\/\/aitimes.link\/wp-content\/uploads\/2024\/07\/image-2.png 1292w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>\u540c\u65f6\uff0cCradle\u5f3a\u5927\u7684\u51b3\u7b56\u63a8\u7406\u6a21\u5757\u8ba9\u5176\u5f97\u4ee5\u81ea\u53d1\u548c\u8f6f\u4ef6\u8fdb\u884c\u4ea4\u4e92\u5e76\u4e14\u5b8c\u6210\u4efb\u52a1\uff0c\u8fd9\u4e2a\u8fc7\u7a0b\u53ef\u4ee5\u88ab\u7b80\u5355\u5730\u603b\u7ed3\u4e3a\uff1a<strong>\u53cd\u601d\u8fc7\u53bb\uff0c\u603b\u7ed3\u73b0\u5728\uff0c\u89c4\u5212\u672a\u6765<\/strong>\u3002<\/p>\n\n\n\n<p><strong>\u53cd\u601d\u8fc7\u53bb\uff1a<\/strong>Cradle\u4f7f\u7528\u6267\u884c\u8fc7\u5f80\u52a8\u4f5c\u8fc7\u7a0b\u7684\u89c6\u9891\u4f5c\u4e3a\u8f93\u5165\uff0c\u5206\u522b\u63d0\u53d6\u51fa\u5176\u4e2d\u5173\u952e\u7684\u6587\u672c\u548c\u89c6\u89c9\u4fe1\u606f\uff0c\u901a\u8fc7\u53cd\u601d\u6765\u5224\u65ad\u4e0a\u4e00\u6b65\u52a8\u4f5c\u662f\u5426\u6267\u884c\u6210\u529f\u4efb\u52a1\u662f\u5426\u5b8c\u6210\u4ee5\u53ca\u5982\u4f55\u6539\u8fdb\u3002<\/p>\n\n\n\n<p><strong>\u603b\u7ed3\u73b0\u5728\uff1a<\/strong>\u53cd\u601d\u5b8c\u4e4b\u540e\uff0cCradle\u9700\u8981\u603b\u7ed3\u5f53\u524d\u60c5\u51b5\uff0c\u5e76\u4e14\u4ee5\u6b64\u4e3a\u6839\u636e\u6765\u51b3\u5b9a\u662f\u5426\u66f4\u6362\u4efb\u52a1\u76ee\u6807\u6216\u662f\u4fee\u6539\u4efb\u52a1\u5185\u5bb9\u3002<\/p>\n\n\n\n<p><strong>\u89c4\u5212\u672a\u6765\uff1a<\/strong>\u6700\u540eCradle\u4f1a\u6839\u636e\u5f53\u524d\u4efb\u52a1\u4ee5\u53ca\u73b0\u72b6\u751f\u6210\u6216\u8005\u66f4\u65b0\u81ea\u8eab\u7684\u6280\u80fd\uff0c\u5e76\u4e14\u4ece\u5df2\u7ecf\u5b66\u4f1a\u7684\u6280\u80fd\u4e2d\u68c0\u7d22\u4e00\u90e8\u5206\u548c\u5f53\u524d\u4efb\u52a1\u76f8\u5173\u7684\u6280\u80fd\u4f5c\u4e3a\u5907\u9009\uff0c\u7136\u540e\u4ece\u4e2d\u9009\u53d6\u5408\u9002\u7684\u6280\u80fd\u5b9e\u4f8b\u5316\u4e3a\u52a8\u4f5c\u53bb\u6267\u884c\u3002<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/BAAI-Agents\/Cradle\">https:\/\/github.com\/BAAI-Agents\/Cradle<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u8fd1\u65e5\uff0c\u6606\u4ed1\u4e07\u7ef4\u643a\u624b\u5317\u4eac\u667a\u6e90\u4eba\u5de5\u667a\u80fd\u7814\u7a76\u9662\u3001\u65b0\u52a0\u5761\u5357\u6d0b\u7406\u5de5\u5927\u5b66\u3001\u5317\u4eac\u5927\u5b66\u7b49\u9876\u5c16\u540d\u6821\u673a\u6784\uff0c\u8054\u5408\u63d0\u51fa\u4e86\u8fc4\u4eca\u4e3a\u6b62\u7b2c\u4e00\u4e2a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-420","post","type-post","status-publish","format-standard","hentry","category-ai"],"_links":{"self":[{"href":"https:\/\/aitimes.link\/index.php\/wp-json\/wp\/v2\/posts\/420","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aitimes.link\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aitimes.link\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aitimes.link\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/aitimes.link\/index.php\/wp-json\/wp\/v2\/comments?post=420"}],"version-history":[{"count":1,"href":"https:\/\/aitimes.link\/index.php\/wp-json\/wp\/v2\/posts\/420\/revisions"}],"predecessor-version":[{"id":434,"href":"https:\/\/aitimes.link\/index.php\/wp-json\/wp\/v2\/posts\/420\/revisions\/434"}],"wp:attachment":[{"href":"https:\/\/aitimes.link\/index.php\/wp-json\/wp\/v2\/media?parent=420"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aitimes.link\/index.php\/wp-json\/wp\/v2\/categories?post=420"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aitimes.link\/index.php\/wp-json\/wp\/v2\/tags?post=420"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}