World Model 논문 리뷰 (1)

5/9/2025·임성혁

수정일: 5/18/2025

논문 제목 : World Models 저자 : David Ha, Jürgen Schmidhuber 투고일(Submitted date) : 2018년 3월 논문 링크 : https://arxiv.org/abs/1803.10122 온라인 논문 링크 : https://worldmodels.github.io/

RNN, VAE, MDN-RNN, CMA-ES

(리뷰를 끝내고 다시 돌아와서 작성)

We explore building generative neural network models of popular reinforcement learning environments. Our $world model$ can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment. 널리 사용되는 강화학습 환경의 생성 신경망 모델을 구축하는 방법을 살펴봅니다. 우리의 월드 모델은 환경의 압축된 공간적, 시간적 표현을 학습하기 위해 비지도 방식으로 빠르게 훈련할 수 있습니다. 월드 모델에서 추출한 특징을 에이전트의 입력으로 사용하면 필요한 작업을 해결할 수 있는 매우 간결하고 간단한 정책을 훈련할 수 있습니다. 심지어 월드 모델에 의해 생성된 환각 속에서 에이전트를 완전히 훈련시키고 이 정책을 실제 환경으로 다시 전송할 수도 있습니다.

Humans develop a mental model of the world based on what they are able to perceive with their limited senses. The decisions and actions we make are based on this internal model. 인간은 제한된 감각으로 인식할 수 있는 것을 바탕으로 세상에 대한 정신적 모델을 개발합니다. 우리가 내리는 결정과 행동은 이 내적 모델을 기반으로 합니다.

To handle the vast amount of information that flows through our daily lives, our brain learns an abstract representation of both spatial and temporal aspects of this information. 우리의 뇌는 일상 생활에서 흘러나오는 방대한 양의 정보를 처리하기 위해 이 정보의 공간적, 시간적 측면을 모두 추상적으로 표현하는 방법을 학습합니다.

Evidence also suggests that what we perceive at any given moment is governed by our brain’s prediction of the future based on our internal model. 또한 특정 순간에 우리가 지각하는 것은 내부 모델을 기반으로 한 뇌의 미래 예측에 의해 좌우된다는 증거도 있습니다.

One way of understanding the predictive model inside of our brains is that it might not be about just predicting the future in general, but predicting future sensory data given our current motor actions. 뇌 속의 예측 모델을 이해하는 한 가지 방법은, 이것이 단순히 미래 전반을 예측하는 것이 아니라, 우리의 현재 운동 행동을 고려하여 미래의 감각 데이터를 예측하는 것일 수 있다는 것입니다.

We are able to instinctively act on this predictive model and perform fast reflexive behaviours when we face danger, without the need to consciously plan out a course of action. 우리는 의식적으로 행동 방침을 계획할 필요 없이 본능적으로 이 에측 모델에 따라 행동하고 위험에 직면했을 때 빠른 반사적 행동을 수행할 수 있습니다.

(축약) 논문에서는 야구에서 타자의 행동을 예시로 듭니다.

In many reinforcement learning (RL) problems, an artificial agent also benefits from having a good representation of past and present states, and a good predictive model of the future, preferably a powerful predictive model implemented on a general purpose computer such as a recurrent neural network (RNN). 많은 강화학습(RL) 문제에서 인공 에이전트는 과거와 현재 상태에 대한 좋은 표현과 미래에 대한 좋은 예측 모델, 가급적 순환 신경망(RNN)과 같은 범용 컴퓨터에서 구현된 강력한 예측 모델을 갖는 것이 유리합니다.

Large RNNs are highly expressive models that can learn rich spatial and temporal representations of data. However, many model-free RL methods in the literature often only use small neural networks with few parameters. The RL algorithm is often bottlenecked by the credit assignment problem, which makes it hard for traditional RL algorithms to learn millions of weights of a large model, hence in practice, smaller networks are used as they iterate faster to a good policy during training. 대형 RNN은 데이터의 풍부한 공간적, 시간적 표현을 학습할 수 있는 높은 표현력을 가진 모델입니다. 그러나 문헌에 나오는 많은 모델 프리(model-free) 강화 학습 방법들은 종종 적은 매개변수를 가진 작은 신경망만 사용합니다. 강화 학습 알고리즘은 보통 신용 할당 문제(credit assignment problem)에 의해 병목 현상이 발생하는데, 이로 인해 전통적인 강화 학습 알고리즘이 대형 모델의 수백만 개의 가중치를 학습하기 어렵게 만듭니다. 따라서 실제로는 훈련 중 더 빠르게 좋은 정책에 도달할 수 있는 더 작은 네트워크가 사용됩니다.

Ideally, we would like to be able to efficiently train large RNN-based agents. The backpropagation algorithm can be used to train large neural networks efficiently. In this work we look at training a large neural network1 to tackle RL tasks, by dividing the agent into a large world model and a small controller model. We first train a large neural network to learn a model of the agent’s world in an unsupervised manner, and then train the smaller controller model to learn to perform a task using this world model. A small controller lets the training algorithm focus on the credit assignment problem on a small search space, while not sacrificing capacity and expressiveness via the larger world model. By training the agent through the lens of its world model, we show that it can learn a highly compact policy to perform its task. 이상적으로는 대형 RNN 기반 에이전트를 효율적으로 훈련할 수 있기를 원합니다. 역전파 알고리즘은 대형 신경망을 효율적으로 훈련하는 데 사용될 수 있습니다. 이 연구에서 우리는 에이전트를 대형 월드 모델과 소형 컨트롤러 모델로 나눔으로써, 대형 신경망을 활용하여 강화 학습 작업을 해결하는 방법을 살펴봅니다. 우리는 먼저 비지도 학습 방식으로 에이전트의 월드 모델을 학습하기 위해 대형 신경망을 훈련시킨 다음, 이 월드 모델을 사용하여 작업을 수행하도록 더 작은 컨트롤러 모델을 훈련시킵니다. 작은 컨트롤러는 훈련 알고리즘이 작은 탐색 공간에서 신용 할당 문제에 집중할 수 있게 하면서, 더 큰 월드 모델을 통해 용량과 표현력을 희생하지 않습니다. 월드 모델의 렌즈를 통해 에이전트를 훈련시킴으로써, 우리는 에이전트가 작업을 수행하기 위한 매우 간결한 정책을 학습할 수 있음을 보여줍니다.

The goal of this article is to distill several key concepts from a series of papers 1990– 2015 on combinations of RNN-based world models and controllers. 이 글의 목표는 1990년부터 2015년까지 발표된 일련의 논문에서 RNN 기반 월드 모델과 컨트롤러의 조합에 관한 몇 가지 핵심 개념을 추출하는 것입니다.

In this article, we present a simplified framework that we can use to experimentally demonstrate some of the key concepts from these papers, and also suggest further insights to effectively apply these ideas to various RL environments. We use similar terminology and notation as On Learning to Think: Algorithmic Information Theory for Novel Combinations of RL Controllers and RNN World Models (Schmidhuber, 2015a) when describing our methodology and experiments. 이 논문에서는 이러한 논문들의 핵심 개념들을 실험적으로 보여줄 수 있는 단순화된 프레임워크를 제시하고, 또한 이러한 아이디어들을 다양한 강화 학습 환경에 효과적으로 적용하기 위한 추가적인 통찰력을 제안합니다. 우리는 방법론과 실험을 설명할 때 "On Learning to Think: Algorithmic Information Theory for Novel Combinations of RL Controllers and RNN World Models" (Schmidhuber, 2015a)와 유사한 용어와 표기법을 사용합니다.

We present a simple model inspired by our own cognitive system. In this model, our agent has a visual sensory component that compresses what it sees into a small representative code. It also has a memory component that makes predictions about future codes based on historical information. Finally, our agent has a decision-making component that decides what actions to take based only on the representations created by its vision and memory components. 우리는 우리 자신의 인지 시스템에서 영감을 얻은 간단한 모델을 제시합니다. 이 모델에서 우리의 에이전트는 보이는 것을 작은 대표 코드로 압축하는 시각 감각 구성 요소를 가지고 있습니다. 또한 과거 정보를 기반으로 미래 코드에 대한 예측을 하는 기억 구성 요소도 있습니다. 마지막으로, 우리의 에이전트는 시각과 기억 구성 요소에 의해 생성된 표현만을 기반으로 어떤 행동을 취할지 결정하는 의사 결정 구성 요소를 가지고 있습니다.

The environment provides our agent with a high dimensional input observation at each time step. This input is usually a 2D image frame that is part of a video sequence. The role of the V model is to learn an abstract, compressed representation of each observed input frame. 환경은 각 시간 단계에서 우리 에이전트에게 고차원 입력 관찰을 제공합니다. 이 입력은 일반적으로 비디오 시퀀스의 일부인 2D 이미지 프레임입니다. V 모델의 역할은 각 관찰된 입력 프레임의 추상적이고 압축된 표현을 학습하는 것입니다.

Here, we use a simple Variational Autoencoder as our V model to compress each image frame into a small latent vector $z$ . 여기서는 간단한 VAE를 V 모델로 사용하여 각 이미지 프레임을 작은 잠재 벡터 $z$ 로 압축합니다.

While it is the role of the V model to compress what the agent sees at each time frame, we also want to compress what happens over time. For this purpose, the role of the M model is to predict the future. The M model serves as a predictive model of the future $z$ vectors that V is expected to produce. Since many complex environments are stochastic in nature, we train our RNN to output a probability density function $p(z)$ instead of a deterministic prediction of $z$ . 에이전트가 각 시간 프레임에서 보는 것을 압축하는 것이 V 모델의 역할이지만, 시간이 지남에 따라 일어나는 일도 압축하고자 합니다. 이를 위해 M 모델의 역할은 미래를 예측하는 것 입니다. M 모델은 V가 생성할 것으로 예상되는 미래의 $z$ 벡터를 예측하는 모델의 역할을 합니다. 많은 복잡한 환경은 본질적으로 확률론적이기 때문에, $z$ 에 대한 결정론적 예측 대신 확률 밀도 함수 $p(z)$ 를 출력하도록 RNN을 훈련시킵니다.

In our approach, we approximate $p(z)$ as a mixture of Gaussian distribution, and train the RNN to output the probability distribution of the next latent vector $z_{t+1}$ given the current and past information made available to it. 우리의 접근 방식에서는 $p(z)$ 를 가우시안 분포의 혼합으로 근사화하고, 현재 및 과거 정보가 주어질 때 다음 잠재 벡터 $z_{t+1}$ 의 확률 분포를 출력하도록 RNN을 훈련합니다.

More specifically, the RNN will model $P(z_{t+1} | a_t, z_t, h_t)$ , where at is the action taken at time $t$ and $h_t$ is the hidden state of the RNN at time t. During sampling, we can adjust a temperature parameter $τ$ to control model uncertainty, and we will find adjusting $τ$ to be useful for training our controller later on. 보다 구체적으로, RNN은 $P(z_{t+1} | a_t, z_t, h_t)$ 을 모델링하며, 여기서 $a_t$ 는 $t$ 시점에 취해진 action이고 $h_t$ 는 $t$ 시점의 RNN의 hidden state입니다. 샘플링 중에 온도 파라미터 $τ$ 를 조정하여 모델 불확실성을 제어할 수 있고, 이후 컨트롤러 훈련에 $τ$ 을 조정하는 것이 유용하다는걸 알게 됩니다.

This approach is known as a Mixture Density Network combined with a RNN (MDN-RNN), and has been applied in the past for sequence generation problems such as generating handwriting and sketches. RNN을 결합한 혼합 밀도 네트워크(MDN-RNN)로 알려진 이 접근 방식은 과거에 손글씨, 스케치 생성처럼 시퀀스 생성 문제에 적용되어 왔습니다.

The Controller (C) model is responsible for determining the course of actions to take in order to maximize the expected cumulative reward of the agent during a rollout of the environment. In our experiments, we deliberately make C as simple and small as possible, and trained separately from V and M, so that most of our agent’s complexity resides in the world model (V and M). 컨트롤러(C) 모델은 환경이 rollout되는 동안 에이전트의 예상 누적 보상을 극대화하기 위해 취해야 할 행동 방침을 결정하는 역할을 담당합니다. 실험에서는 의도적으로 C를 가능한 한 단순하고 작게 만들고 V 및 M과 별도로 학습시켜 에이전트의 복잡성 대부분이 월드 모델(V 및 M)에 존재하도록 했습니다.

rollout : 에이전트가 환경 안에서 현재 정책(policy)에 따라 행동을 선택하고, 그 결과로 상태(state), 보상(reward), 다음 상태(next state)를 차례로 관찰하는 시뮬레이션 실행 과정

C is a simple single layer linear model that maps $z_t$ and $h_t$ directly to action $a_t$ at each time step: C는 각 시간 단계에서 $z_t$ 와 $h_t$ 를 action $a_t$ 로 직접 매핑하는 간단한 단일 레이어 선형 모델입니다.

a_t = W_c [z_t h_t] + b_c

In this linear model, $W_c$ and $b_c$ are the weight matrix and bias vector that maps the concatenated input vector $[z_t h_t]$ to the output action vector $a_t$ . 이 선형 모델에서 $W_c$ 와 $b_c$ 는 연결된 입력 벡터 $[z_t h_t]$ 를 output action vector $a_t$ 에 매핑하는 가중치 행렬과 bias vector입니다.

Flow diagram of our Agent model. 에이전트 모델의 흐름도

The raw observation is first processed by V at each time step $t$ to produce $z_t$ . The input into C is this latent vector $z_t$ concatenated with M’s hidden state $h_t$ at each time step. C will then output an action vector $a_t$ for motor control, and will affect the environment. M will then take the current $z_t$ and action $a_t$ as an input to update its own hidden state to produce $h_{t+1}$ to be used at time ${t+1}$ . 원시 관측값은 먼저 각 시간 단계 $t$ 에서 V에 의해 처리되어 $z_t$ 를 생성합니다. C에 입력되는 것은 각 시간 단계마다 M의 hidden state $h_t$ 와 latent vector $z_t$ 을 연결한 것(concat) 입니다. 그런 다음 C는 모터 제어를 위한 action vector $a_t$ 를 출력하고 환경에 영향을 줍니다. 그런다음 M은 현재 $z_t$ 와 action $a_t$ 를 입력으로 받아 자신의 hidden state를 업데이트하여 시간 ${t+1}$ 에 사용할 $h_{t+1}$ 을 생성합니다.

Below is the pseudocode for how our agent model is used in the OpenAI Gym (Brockman et al., 2016) environment: 아래는 OpenAI Gym 환경에서 에이전트 모델이 어떻게 사용되는지에 대한 의사코드입니다:

def rollout(controller):
’’’ env, rnn, vae are ’’’
’’’ global variables ’’’
obs = env. reset()
h = rnn.initial_state()
done = False
cumulative_reward = 0
while not done:
	z = vae.encode(obs)
	a = controller. action([z, h])
	obs, reward, done = env. step(a)
	cumulative_reward += reward
	h = rnn.forward([a, z, h])
return cumulative_reward

Running this function on a given controller C will return the cumulative reward during a rollout. 지정된 컨트롤러 C에서 이 함수를 실행하면 rollout이 진행되는 동안 누적 보상을 반환합니다.

This minimal design for C also offers important practical benefits. Advances in deep learning provided us with the tools to train large, sophisticated models efficiently, provided we can define a well-behaved, differentiable loss function. Our V and M models are designed to be trained efficiently with the backpropagation algorithm using modern GPU accelerators, so we would like most of the model’s complexity, and model parameters to reside in V and M. The number of parameters of C, a linear model, is minimal in comparison. This choice allows us to explore more unconventional ways to train C – for example, even using evolution strategies (ES) to tackle more challenging RL tasks where the credit assignment problem is difficult. C를 위한 이러한 최소한의 설계는 중요한 실용적인 이점도 제공합니다. 딥러닝의 발전으로 대규모의 정교한 모델을 효율적으로 훈련할 수 있는 도구가 제공되어 잘 작동하고 미분가능한 손실 함수를 정의할 수 있게 되었습니다. 최신 GPU 가속기를 사용하는 역전파 알고리즘으로 효율적으로 훈련될 수 있도록 설계된 V와 M 모델은 모델의 복잡성과 모델 파라미터의 대부분이 V와 M에 집중되어 있으며, 선형 모델인 C의 파라미터 수는 이에 비해 매우 적습니다. 이러한 선택은 C를 훈련하는 더 색다른 방법을 모색할 수 있게 해줍니다. 예를 들어, 신용 할당 문제(credit assignment problem)가 어려운 강화학습 작업을 해결하기 위해 진화 전략(ES)을 사용할 수도 있습니다.

To optimize the parameters of C, we chose the CovarianceMatrix Adaptation Evolution Strategy (CMA-ES) as our optimization algorithm since it is known to work well for solution spaces of up to a few thousand parameters. We evolve parameters of C on a single machine with multiple CPU cores running multiple rollouts of the environment in parallel. C의 파라미터를 최적화하기 위해 최대 수천 개의 파라미터로 구성된 솔루션 공간에서 잘 작동하는 것으로 알려진 공분산행렬 적응 진화 전략(CMA-ES)을 최적화 알고리즘으로 선택했습니다. 여러개의 CPU 코어를 갖춘 단일 머신에서 여러 rollout 환경을 병렬로 실행하여 C의 파라미터를 진화시킵니다.

In this section, we describe how we can train the Agent model described earlier to solve a car racing task. To our knowledge, our agent is the first known solution to achieve the score required to solve this task. 이 섹션에서는 앞서 설명한 에이전트 모델을 자동차 경주 과제를 풀도록 훈련하는 방법을 설명합니다. 저희가 알기로는 에이전트가 이 과제를 푸는 데 필요한 점수를 획득한 최초의 솔루션입니다.

-> 해당 과제를 택한 이유 We find this task interesting because although it is not difficult to train an agent to wobble around randomly generated tracks and obtain a mediocre score, CarRacing-v0 defines solving as getting average reward of 900 over 100 consecutive trials, which means the agent can only afford very few driving mistakes. CarRacing-v0 환경에서는 무작위로 생성된 트랙에서 에이전트가 불안정하게 움직이며 평범한 점수를 얻도록 훈련시키는 것이 어렵지 않지만, 이 환경은 연속된 100번의 시도에서 평균 900점의 보상을 얻는 것을 해결 기준으로 정의하고 있어, 에이전트는 매우 적은 주행 실수만 할 수 있다는 점에서 흥미로운 과제입니다.

A predictive world model can help us extract useful representations of space and time. By using these features as inputs of a controller, we can train a compact and minimal controller to perform a continuous control task, such as learning to drive from pixel inputs for a top-down car racing environment called CarRacing-v0. 예측 세계 모델은 공간과 시간에서의 유용한 표현을 추출하는데 도움을 줍니다. 컨트롤러의 입력으로 이러한 특성을 활용함으로써 CarRacing-v0이라는 하향식 자동차 경주 환경에서 픽셀 입력으로 주행하는 법을 학습하는 것과 같이 작고 최소한의 컨트롤러가 지속적인 제어 작업을 수행하도록 훈련할 수 있습니다.

In this environment, the tracks are randomly generated for each trial, and our agent is rewarded for visiting as many tiles as possible in the least amount of time. The agent controls three continuous actions: steering left/right, acceleration, and brake. 이 환경에서는 각 시험마다 트랙이 무작위로 생성되며, 에이전트는 최소한의 시간 내에 가능한 많은 타일을 방문하면 보상을 받습니다. 에이전트는 왼쪽/오른쪽 스티어링, 가속, 브레이크의 세 가지 연속 동작을 제어합니다.

To train our V model, we first collect a dataset of 10,000 random rollouts of the environment. We have first an agent acting randomly to explore the environment multiple times, and record the random actions $a_t$ taken and the resulting observations from the environment. We use this dataset to train V to learn a latent space of each frame observed. We train our VAE to encode each frame into low dimensional latent vector $z$ by minimizing the difference between a given frame and the reconstructed version of the frame produced by the decoder from $z$ . V 모델을 훈련하기 위해 먼저 환경에 대한 10,000개의 무작위 롤아웃 데이터 세트를 수집합니다. 먼저 에이전트가 무작위로 행동하여 환경을 여러 번 탐색하고, 수행한 무작위 action $a_t$ 와 그 결과 관찰된 환경을 기록합니다. 이 데이터 세트를 사용하여 V가 관찰한 각 프레임에서 잠재 공간(latent vector)을 학습하도록 훈련합니다. 주어진 프레임과 디코더가 $z$ 에서 생성한 프레임의 재구성된 버전 간의 차이를 최소화하여 각 프레임을 저차원 잠재 벡터 $z$ 로 인코딩하도록 VAE를 훈련합니다.

We can now use our trained V model to pre-process each frame at time $t$ into $z_t$ to train our M model. Using this pre-processed data, along with the recorded random actions $a_t$ taken, our MDN-RNN can now be trained to model $P(z_{t+1} | a_t, z_t, h_t)$ as a mixture of Gaussians. 이제 훈련된 V 모델을 사용하여 $t$ 시점의 각 프레임을 $z_t$ 로 사전 M 모델을 훈련할 수 있습니다. 이렇게 사전 처리된 데이터와 기록된 임의의 action $a_t$ 를 사용하여 이제 MDN-RNN을 가우시안 혼합으로 $P(z_{t+1} | a_t, z_t, h_t)$ 를 모델링하도록 훈련할 수 있습니다.

위의 문단에 대한 주석 In principle, we can train both models together in an end-toend manner, although we found that training each separately is more practical, and also achieves satisfactory results. Training each model only required less than an hour of computation time on a single GPU. We can also train individual VAE and MDN-RNN models without having to exhaustively tune hyperparameters. 원칙적으로는 두 모델을 엔드 투 엔드 방식으로 함께 훈련할 수 있지만, 각각을 개별적으로 훈련하는 것이 더 실용적이고 만족스러운 결과를 얻을 수 있다는 것을 발견했습니다. 각 모델을 훈련하는 데는 단일 GPU에서 1시간 미만의 계산 시간만 필요했습니다. 또한 하이퍼 파라미터를 철저하게 조정할 필요 없이 개별 VAE 및 MDN-RNN 모델을 훈련할 수도 있습니다.

In this experiment, the world model (V and M) has no knowledge about the actual reward signals from the environment. Its task is simply to compress and predict the sequence of image frames observed. Only the Controller (C) Model has access to the reward information from the environment. Since there are a mere 867 parameters inside the linear controller model, evolutionary algorithms such as CMA-ES are well suited for this optimization task. 이 실험에서 월드 모델(V와 M)은 환경의 실제 보상 신호에 대한 지식이 없습니다. 이 모델의 임무는 단순히 관찰된 이미지 프레임의 순서를 압축하고 예측하는 것입니다. 컨트롤러(C) 모델만이 환경의 보상 정보에 접근할 수 있습니다. 선형 컨트롤러 모델 내부에는 867개의 파라미터가 있기 때문에 CMA-ES와 같은 진화 알고리즘은 이 최적화 작업에 매우 적합합니다.

We can use the VAE to reconstruct each frame using $z_t$ at each time step to visualize the quality of the information the agent actually sees during a rollout. The figure below is a VAE model trained on screenshots from CarRacing-v0. VAE를 사용하여 각 시간 단계에서 $z_t$ 를 사용하여 각 프레임을 재구성하여 에이전트가 롤아웃 중에 실제로 보게 되는 정보의 품질을 시각화할 수 있습니다. 아래 그림은 CarRacing-v0의 스크린샷으로 학습된 VAE 모델입니다.

Figure 10. Despite losing details during this lossy compression process, latent vector

z

captures the essence of each image frame. 그림 10. 이 손실 압축 과정에서 디테일이 손실되더라도 잠재 벡터

z

는 각 이미지 프레임의 본질을 포착합니다.

In the online version of this article, one can load randomly chosen screenshots to be encoded into a small latent vector $z$ , which is used to reconstruct the original screenshot. One can also experiment with adjusting the values of the $z$ vector using the slider bars to see how it affects the reconstruction, or randomize $z$ to observe the space of possible screenshots. 이 논문의 온라인 버전에서는 무작위로 선택한 스크린샷을 불러와 원본 스크린샷을 재구성하는 데 사용되는 작은 잠재 벡터 $z$ 로 인코딩할 수 있습니다. 슬라이더 막대를 사용하여 $z$ 벡터의 값을 조정하여 재구성에 어떤 영향을 미치는지 실험해 보거나 $z$ 를 무작위로 지정하여 가능한 스크린샷의 공간을 관찰할 수도 있습니다.

To summarize the Car Racing experiment, below are the steps taken: 자동차 경주 실험을 요약하면 다음과 같습니다:

Collect 10,000 rollouts from a random policy.
Train VAE (V) to encode frames into $z \in \mathbb{R}^{32}$ .
Train MDN-RNN (M) to model $P(z_{t+1} | a_t, z_t, h_t)$ .
Define Controller (C) as $a_t = W_c [z_t\;h_t] + b_c$ .
Use CMA-ES to solve for a Wc and bc that maximizes the expected cumulative reward.

무작위 정책에서 10,000개의 롤아웃을 수집합니다.
VAE(V)를 훈련시켜 프레임을 $z \in \mathbb{R}^{32}$ 로 인코딩합니다.
MDN-RNN(M)을 모델 $P(z_{t+1} | a_t, z_t, h_t)$ 로 훈련합니다.
컨트롤러(C)를 $a_t = W_c [z_t\;h_t] + b_c$ 로 정의합니다.
CMA-ES를 사용하여 예상 누적 보상을 최대화하는 $W_c$ 와 $b_c$ 를 찾습니다.

MODEL	PARAMETER COUNT
VAE	4,348,547
MDN-RNN	422,368
CONTROLLER	867

V Model Only Training an agent to drive is not a difficult task if we have a good representation of the observation. Previous works (Hnermann, 2017; Bling, 2015; Lau, 2016) have shown that with a good set of hand-engineered information about the observation, such as LIDAR information, angles, positions and velocities, one can easily train a small feed-forward network to take this hand-engineered input and output a satisfactory navigation policy. For this reason, we first want to test our agent by handicapping C to only have access to V but not M, so we define our controller as $a_t = W_c z_t + b_c$ . V 모델만 사용했을 때 관측 정보를 잘 표현할 수 있다면 에이전트가 운전하도록 훈련시키는 것은 어려운 일이 아닙니다. 이전연구에 따르면 LIDAR 정보, 각도, 위치, 속도와 같이 관측에 대한 좋은 수작업 정보만 있으면 작은 피드포워드 네트워크를 쉽게 훈련하여 이 수작업 입력을 받아 만족스러운 탐색 정책을 출력할 수 있다고 합니다. 따라서 먼저 에이전트를 테스트하기 위해 C가 V에만 접근 가능하게 하고 M에는 접근하지 못하도록 핸디캡을 지정하여 컨트롤러를 $a_t = W_c z_t + b_c$ 로 정의합니다.

Figure 11. Limiting our controller to see only

z_t

, but not

h_t

results in wobbly and unstable driving behaviours. 컨트롤러가

z_t

만 보고

h_t

는 보지 못하도록 제한하면 운전 동작이 흔들리고 불안정해집니다.

Although the agent is still able to navigate the race track in this setting, we notice it wobbles around and misses the tracks on sharper corners. This handicapped agent achieved an average score of 632 ± 251 over 100 random trials, in line with the performance of other agents on OpenAI Gym’s leaderboard (Klimov, 2016) and traditional Deep RL methods such as A3C (Khan & Elibol, 2016; Jang et al., 2017). Adding a hidden layer to C’s policy network helps to improve the results to 788 ± 141, but not quite enough to solve this environment. 이 에이전트는 이 설정에서도 여전히 레이스 트랙을 탐색할 수 있지만, 급격한 코너에서는 흔들리고 트랙에서 벗어나는 것을 확인할 수 있었습니다. 이 제한된 에이전트는 100번의 무작위 시험에서 평균 632 ± 251점을 획득했는데, 이는 OpenAI Gym의 리더보드 및 A3C와 같은 기존 딥러닝 방법의 다른 에이전트의 성능과 일치하는 수준입니다. C의 정책 네트워크에 숨겨진 계층(hidden layer)을 추가하면 788 ± 141로 결과를 개선하는 데 도움이 되지만 이 환경을 해결하기에는 충분하지 않습니다.

Full World Model (V and M) The representation $z_t$ provided by our V model only captures a representation at a moment in time and does not have much predictive power. In contrast, M is trained to do one thing, and to do it really well, which is to predict $z_{t+1}$ . Since M’s prediction of $z_{t+1}$ is produced from the RNN’s hidden state $h_t$ at time t, this vector is a good candidate for the set of learned features we can give to our agent. Combining $z_t$ with $h_t$ gives our controller C a good representation of both the current observation, and what to expect in the future. 전체 세계 모델 (V 와 M) V 모델에서 제공하는 표현 $z_t$ 는 특정 시점의 표현만 포착할 뿐이며 예측력이 높지 않습니다. 반면, M은 한가지 일을 아주 잘하도록 훈련되어 있는데, 바로 $z_{t+1}$ 을 예측하는 것 입니다. M의 $z_{t+1}$ 예측은 $t$ 시점의 RNN의 숨겨진 상태(hidden state) $h_t$ 에서 생성되므로 이 벡터는 에이전트에게 제공할 수 있는 학습된 특징 집합에 적합한 후보입니다. $z_t$ 와 $h_t$ 를 결합하면 컨트롤러 C는 현재 관측값과 향후 예상되는 결과를 모두 잘 표현할 수 있습니다.

Figure 12. Driving is more stable if we give our controller access to both

z_t

and

h_t

. 그림 12. 컨트롤러에

z_t

와

h_t

에 대한 접근 권한을 모두 부여하면 주행이 더 안정적입니다.

We see that allowing the agent to access the both $z_t$ and $h_t$ greatly improves its driving capability. The driving is more stable, and the agent is able to seemingly attack the sharp corners effectively. Furthermore, we see that in making these fast reflexive driving decisions during a car race, the agent does not need to plan ahead and roll out hypothetical scenarios of the future. Since $h_t$ contain information about the probability distribution of the future, the agent can just query the RNN instinctively to guide its action decisions. Like a seasoned Formula One driver or the baseball player discussed earlier, the agent can instinctively predict when and where to navigate in the heat of the moment. 에이전트가 $z_t$ 와 $h_t$ 모두에 액세스할 수 있도록 허용하면 주행 능력이 크게 향상되는 것을 확인할 수 있습니다. 주행이 더 안정적이고 에이전트가 급격한 코너를 더 효과적으로 공략할 수 있게 되었습니다. 또한, 자동차 경주 중에 이러한 빠른 반사적 주행 결정을 내릴 때 에이전트는 미리 계획을 세우고 미래의 가상 시나리오를 실행할 필요가 없다는 것을 알 수 있습니다. $h_t$ 에는 미래의 확률 분포에 대한 정보가 포함되어 있기 때문에 에이전트는 본능적으로 RNN에 질의하여 행동 결정을 내릴 수 있습니다. 앞서 설명한 노련한 포물러 원 드라이버나 야구 선수처럼 에이전트는 순간적으로 언제, 어디로 이동해야할지 본능적으로 예측할 수 있습니다.

METHOD	AVG. SCORE
DQN (PRIEUR, 2017)	343 ± 18
A3C (CONTINUOUS) (JANG ET AL., 2017)	591 ± 45
A3C (DISCRETE) (KHAN & ELIBOL, 2016)	652 ± 10
CEOBILLIONAIRE (GYM LEADERBOARD)	838 ± 11
V MODEL	632 ± 251
V MODEL WITH HIDDEN LAYER	788 ± 141
FULL WORLD MODEL	906 ± 21

Table 1. CarRacing-v0 scores achieved using various methods. 표 1. 다양한 방법으로 달성한 CarRacing-v0 점수.

(대충 우리 모델 성능 좋아요~를 어필하는 성능 비교 표)

Our agent is able to achieve a score of 906 ± 21 over 100 random trials, effectively solving the task and obtaining new state of the art results. Previous attempts (Khan & Elibol, 2016; Jang et al., 2017) using Deep RL methods obtained average scores of 591–652 range, and the best reported solution on the leaderboard obtained an average score of 838 ± 11 over 100 random trials. Traditional Deep RL methods often require pre-processing of each frame, such as employing edge-detection (Jang et al., 2017), in addition to stacking a few recent frames (Khan & Elibol, 2016; Jang et al., 2017) into the input. In contrast, our world model takes in a stream of raw RGB pixel images and directly learns a spatial-temporal representation. To our knowledge, our method is the first reported solution to solve this task. 100번의 무작위 시도를 통해 906 ± 21의 점수를 획득하여 효과적으로 과제를 해결하고 새로운 SOTA 결과를 얻을 수 있었습니다. 딥러닝 방법을 사용한 이전 시도는 평균 591-652 범위의 점수를 얻었으며, 리더보드에 보고된 가장 뛰어난 솔루션은 100번의 무작위 시도에서 평균 838 ± 11점을 얻었습니다. 기존의 딥러닝 강화학습 방식은 입력에 최근 프레임을 몇 개 쌓는 것 외에도 edge-detection을 사용하는 등 각 프레임에 대한 사전 처리가 필요한 경우가 많습니다. 이와는 대조적으로, 우리의 월드 모델은 원시 RGB 픽셀 이미지의 스트림을 받아 공간-시간적 표현을 직접 학습합니다. 우리가 알기로는 우리 방식은 이 문제를 해결하기 위한 최초의 보고된 솔루션입니다.

Since our world model is able to model the future, we are also able to have it come up with hypothetical car racing scenarios on its own. We can ask it to produce the probability distribution of $z_{t+1}$ given the current states, sample a $z_{t+1}$ and use this sample as the real observation. We can put our trained C back into this hallucinated environment generated by M. The following image from an interactive demo in the online version of this article shows how our world model can be used to hallucinate the car racing environment: 우리의 월드 모델은 미래를 모델링할 수 있기 때문에, 스스로 가상의 자동차 경주 시나리오를 만들어낼 수도 있습니다. 현재 상태가 주어졌을 때 $z_{t+1}$ 의 확률 분포를 생성하도록 요청하고, $z_{t+1}$ 을 표본으로 추출한 다음 이 표본을 실제 관측값으로 사용할 수 있습니다. 이 글의 온라인 버전에 대화형 데모의 다음 이미지는 월드 모델을 사용하여 자동차 경주 환경을 환각화(hallucinate)하는 방법을 보여줍니다:

Figure 13. Our agent driving inside of its own dream world. Here, we deploy our trained policy into a fake environment generated by the MDN-RNN, and rendered using the VAE’s decoder. In the demo, one can override the agent’s actions as well as adjust

τ

to control the uncertainty of the environment generated by M. 그림 13. 에이전트가 꿈의 세계 내부를 운전하는 모습. 여기서 우리는 훈련된 정책을 MDN-RNN이 생성한 가짜 환경에 배포하고 VAE 디코더를 활용해 렌더링합니다. 데모에서 M이 생성한 환경의 불확실성을 제어하고자

τ

를 조정하는 것은 물론 에이전트의 동작을 재정의할 수 있습니다.

World Model 논문 리뷰 (1)

논문리뷰 공유 시작

World Model 논문 리뷰 (2)