{"id":802,"date":"2010-07-27T13:28:03","date_gmt":"2010-07-27T04:28:03","guid":{"rendered":"http:\/\/www.cns.atr.jp\/bri\/?page_id=802"},"modified":"2018-11-02T14:33:31","modified_gmt":"2018-11-02T05:33:31","slug":"learningbiped-locomotion","status":"publish","type":"page","link":"https:\/\/bicr.atr.jp\/bri\/en\/research\/learningbiped-locomotion\/","title":{"rendered":"LearningBiped Locomotion"},"content":{"rendered":"<div id=\"researchT1\">\n<h3><span style=\"font-family: helvetica;\">Goal<\/span><\/h3>\n<p><span style=\"font-family: helvetica;\">The goal of our study is to understand the principle of human motor control and optimal control strategies. Such discrete movements as two-point reaching arm movements have been studied to understand the optimal control strategy of humans. On the other hand, such periodic behaviors as biped walking have not been intensively studied from an optimal control viewpoint. We focused on optimization and learning algorithms for periodic movements, especially for the development of biped walking optimization algorithms to understand human biped control mechanisms.<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">To understand human biped walking strategies, we started to work on a central pattern generator model (CPG), which is a neural oscillator model of animals. The neural oscillator takes an important role for legged locomotion to synchronize the periodic patterns of body movements with environments. By assuming that humans are using CPG-based synchronization mechanisms, we developed learning algorithms for biped locomotion based on an oscillator model to generate periodic leg behaviors.<\/span><\/p>\n<\/div>\n<p><!--researchT1--><\/p>\n<h3><span style=\"font-family: helvetica;\">Reinforcement learning algorithms for biped locomotion<\/span><\/h3>\n<p><span style=\"font-family: helvetica;\">We first tested our CPG model on a simple biped robot named DB-chan that is constrained on a sagittal plane by a boom\uff08see IMG1\uff09.<\/span><\/p>\n<p style=\"text-align: center;\"><span style=\"font-family: helvetica;\"><a href=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG15.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-932 aligncenter\" title=\"IMG1\" src=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG15.jpg\" alt=\"\" width=\"447\" height=\"270\" srcset=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG15.jpg 745w, https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG15-300x181.jpg 300w\" sizes=\"auto, (max-width: 447px) 100vw, 447px\" \/><\/a><br \/>\n<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">We extended our idea of a CPG-based biped controller to a humanoid robot model. The humanoid robot model has many more numbers of degrees of freedom than DB-chan. In general, solving a non-linear optimal control problem is difficult for a system that has many degrees of freedom. The problem becomes especially complicated if we apply a policy optimization method to a real robot since many learning iterations are required to optimize the policies, and the hardware may be damaged during the learning trails. On the other hand, humans can learn proper control policies from a limited number of learning iterations.<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">In our study, we identified environmental models using a non-parametric regression method from few learning iterations. Then we optimized the learning policies on the identified non-parametric model and tested our approach on humanoid robots\uff08see the movie\uff09.<\/span><\/p>\n<div style=\"width: 640px;\" class=\"wp-video\"><video class=\"wp-video-shortcode\" id=\"video-802-1\" width=\"640\" height=\"480\" preload=\"metadata\" controls=\"controls\"><source type=\"video\/mp4\" src=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2018\/11\/CBi_walk.x264.aac.mp4?_=1\" \/><a href=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2018\/11\/CBi_walk.x264.aac.mp4\">https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2018\/11\/CBi_walk.x264.aac.mp4<\/a><\/video><\/div>\n<p><span style=\"font-family: helvetica;\">\u00a0<\/span><\/p>\n<div style=\"width: 640px;\" class=\"wp-video\"><video class=\"wp-video-shortcode\" id=\"video-802-2\" width=\"640\" height=\"480\" preload=\"metadata\" controls=\"controls\"><source type=\"video\/mp4\" src=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2018\/11\/CBi_turn_L.x264.aac.mp4?_=2\" \/><a href=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2018\/11\/CBi_turn_L.x264.aac.mp4\">https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2018\/11\/CBi_turn_L.x264.aac.mp4<\/a><\/video><\/div>\n<p><span style=\"font-family: helvetica;\">\u00a0<\/span><\/p>\n<p><!--researchT2--><\/p>\n<h3><span style=\"font-family: helvetica;\">Learning phase modulation policies<\/span><\/h3>\n<p><span style=\"font-family: helvetica;\">In this study, we proposed a learning algorithm of phase modulation policies based on a policy gradient method. The target of the policy is to synchronize the biped robot with its environments.<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">In our learning framework, each desired joint angle depends on the phase. The phase modulation policies are represented by probability distributions\uff08see IMG2\uff09.<\/span><\/p>\n<p style=\"text-align: center;\"><span style=\"font-family: helvetica;\"><a href=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG21.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-937 aligncenter\" title=\"IMG2\" src=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG21.jpg\" alt=\"\" width=\"480\" height=\"221\" srcset=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG21.jpg 800w, https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG21-300x138.jpg 300w\" sizes=\"auto, (max-width: 480px) 100vw, 480px\" \/><\/a><br \/>\n<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">The output of the phase modulation policies is the amount of phase modulation based on the current phase. We applied our proposed method to the simulated DB-chan model.<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">\uff08see IMG3\uff09(Top) Without phase modulation policies (Bottom) With an acquired phase modulation policy. The DB-chan model could walk without falling.<\/span><\/p>\n<p style=\"text-align: center;\"><span style=\"font-family: helvetica;\"><a href=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG3.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-938 aligncenter\" title=\"IMG3\" src=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG3.jpg\" alt=\"\" width=\"446\" height=\"265\" srcset=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG3.jpg 744w, https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG3-300x178.jpg 300w\" sizes=\"auto, (max-width: 446px) 100vw, 446px\" \/><\/a><br \/>\n<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">The DB-chan model could walk without falling even if its swing leg was disturbed by a simulated obstacle on the ground by an acquired phase modulation policy. The figure shows the generated walking pattern\uff08see IMG4\uff09. <\/span><\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG41.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-941 aligncenter\" title=\"IMG4\" src=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG41.jpg\" alt=\"\" width=\"377\" height=\"150\" srcset=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG41.jpg 628w, https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG41-300x119.jpg 300w\" sizes=\"auto, (max-width: 377px) 100vw, 377px\" \/><\/a><\/p>\n<p><span style=\"font-family: helvetica;\">The DB-chan model could walk even after the presence of the disturbance by modulating phase based on the acquired policy.<\/span><\/p>\n<p><!--researchT3--><\/p>\n<h3><span style=\"font-family: helvetica;\">Poincar\u00e9-map based reinforcement learning for learning biped locomotion<\/span><\/h3>\n<p><span style=\"font-family: helvetica;\">In this study, we propose to learn locally stable biped walking policies by estimating the Poincar\u00e9 map through learning trials. We represent leg trajectories by a fifth order spline function that interpolates by points, which are modulated by an acquired walking policy.<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">We optimize biped walking controllers based on an approximated Poincar\u00e9 map using a model-based reinforcement learning framework. The Poincar\u00e9 map represents the locus of the intersection of the biped trajectory with a hyperplane in the full state space.<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">In our case, we are interested in the system state at two symmetric phase angles of the walking gait. Modulating via points affects the locus of intersection, and our learned model reflects this effect\uff08see IMG5\uff09.<\/span><\/p>\n<p style=\"text-align: center;\"><span style=\"font-family: helvetica;\"><a href=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG51.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-943 aligncenter\" title=\"IMG5\" src=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG51.jpg\" alt=\"\" width=\"436\" height=\"329\" srcset=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG51.jpg 726w, https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG51-300x226.jpg 300w\" sizes=\"auto, (max-width: 436px) 100vw, 436px\" \/><\/a><br \/>\n<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">Since the policy output is only changed at the Poincar\u00e9 section, our method can be considered a learning scheme for a policy to output a proper \u201coption\u201d for a Semi-Markov Decision Process (SMDP).<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">We derive the update rule for a policy using the value function and the estimated Poincar\u00e9 map.<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">The update rule is the following:<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">1.1 Derive the gradient of the learned Poincar\u00e9 map model.<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">1.2 Derive the gradient of the approximated value function.<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">1.3 To update the policy parameter, compute a desired output by the inner product of the gradients of the value function and the Poincar\u00e9 map.<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">Acquired value function and policy are presented in \uff08IMG6\uff09.<\/span><\/p>\n<p style=\"text-align: center;\"><span style=\"font-family: helvetica;\"><a href=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG61.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-945 aligncenter\" title=\"IMG6\" src=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG61.jpg\" alt=\"\" width=\"480\" height=\"198\" srcset=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG61.jpg 800w, https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2010\/07\/IMG61-300x123.jpg 300w\" sizes=\"auto, (max-width: 480px) 100vw, 480px\" \/><\/a><br \/>\n<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">We applied the proposed learning method to the real biped robot(DB-chan). Movies show walking behaviors before and after the learning process.(See below)<\/span><\/p>\n<p><span style=\"font-family: helvetica;\">\u00a0<\/span><\/p>\n<p>Before Learning<\/p>\n<div style=\"width: 720px;\" class=\"wp-video\"><video class=\"wp-video-shortcode\" id=\"video-802-3\" width=\"720\" height=\"480\" preload=\"metadata\" controls=\"controls\"><source type=\"video\/mp4\" src=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2018\/11\/before_learning_far_view.x264.aac.mp4?_=3\" \/><a href=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2018\/11\/before_learning_far_view.x264.aac.mp4\">https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2018\/11\/before_learning_far_view.x264.aac.mp4<\/a><\/video><\/div>\n<p>After Learning<\/p>\n<div style=\"width: 720px;\" class=\"wp-video\"><video class=\"wp-video-shortcode\" id=\"video-802-4\" width=\"720\" height=\"480\" preload=\"metadata\" controls=\"controls\"><source type=\"video\/mp4\" src=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2018\/11\/after_learning_far_view_metal.x264.aac.mp4?_=4\" \/><a href=\"https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2018\/11\/after_learning_far_view_metal.x264.aac.mp4\">https:\/\/bicr.atr.jp\/bri\/wp-content\/uploads\/2018\/11\/after_learning_far_view_metal.x264.aac.mp4<\/a><\/video><\/div>\n<div id=\"researchT5\">\n<h3><span style=\"font-family: helvetica;\">References<\/span><\/h3>\n<p><span style=\"font-family: helvetica;\">[1]\u3000Morimoto\uff0cJ.\uff0c Atkeson\uff0cC. G.\uff0c Endo\uff0c G.\uff0c &amp; Cheng\uff0c G. Improving humanoid locomotive performance with learnt approximated dynamics via Gaussian processes for regression\uff0c In IEEE\/RSJ International Conference on Intelligent Robots and Systems\uff0c pp. 4234-4240\uff0c San Diego\uff0c CA\uff0c USA (2007)<\/span><\/p>\n<\/div>\n<p><!--researchT5--><span style=\"font-family: helvetica;\"><br \/>\n<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Goal The goal of our study is to understand the principle of human motor control and optimal control strategies. Such discrete movements as two-point reaching arm movements have been studied to understand the optimal control strategy of humans. On the other hand, such periodic behaviors as biped walking have not been intensively studied from an optimal control viewpoint. We focused on optimization and learning algorithms for periodic movements, especially for the development of biped walking optimization algorithms to understand human biped control mechanisms. To understand human biped walking strategies, we started to work on a central pattern generator model (CPG), which is a neural oscillator model of animals. The neural [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"parent":806,"menu_order":4,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-802","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/bicr.atr.jp\/bri\/en\/wp-json\/wp\/v2\/pages\/802","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bicr.atr.jp\/bri\/en\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/bicr.atr.jp\/bri\/en\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/bicr.atr.jp\/bri\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/bicr.atr.jp\/bri\/en\/wp-json\/wp\/v2\/comments?post=802"}],"version-history":[{"count":17,"href":"https:\/\/bicr.atr.jp\/bri\/en\/wp-json\/wp\/v2\/pages\/802\/revisions"}],"predecessor-version":[{"id":2526,"href":"https:\/\/bicr.atr.jp\/bri\/en\/wp-json\/wp\/v2\/pages\/802\/revisions\/2526"}],"up":[{"embeddable":true,"href":"https:\/\/bicr.atr.jp\/bri\/en\/wp-json\/wp\/v2\/pages\/806"}],"wp:attachment":[{"href":"https:\/\/bicr.atr.jp\/bri\/en\/wp-json\/wp\/v2\/media?parent=802"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}