49 lines
2.6 KiB
Markdown
49 lines
2.6 KiB
Markdown
|
---
|
|||
|
title: "masters thesis"
|
|||
|
weight: 20
|
|||
|
---
|
|||
|
# Reinforcement Learning<br/>Theory and Implementation in a Custom Environment
|
|||
|
---
|
|||
|
you can find the thesis [here](/pdf/mthesis.pdf) and the code [here](https://github.com/aethrvmn/GodotPneumaRL)
|
|||
|
|
|||
|
## Abstract
|
|||
|
|
|||
|
Reinforcement Learning (RL) is a subcategory of Machine Learning that consis-
|
|||
|
tently surpasses human performance and demonstrates superhuman understand-
|
|||
|
ing in various environments and datasets. Its applications span from master-
|
|||
|
ing games like Go and Chess to optimizing real-world operations in robotics, fi-
|
|||
|
nance, and healthcare. The adaptability and efficiency of RL algorithms in dynamic
|
|||
|
and complex scenarios highlight their transformative potential across multiple do-
|
|||
|
mains.
|
|||
|
|
|||
|
In this thesis, we present some core concepts of Reinforcement Learning.
|
|||
|
|
|||
|
First, we introduce the mathematical foundation of Reinforcement Learning
|
|||
|
(RL) through Markov Decision Processes (MDPs), which provide a formal frame-
|
|||
|
work for modeling decision-making problems where outcomes are partly random
|
|||
|
and partly under the control of a decision-maker, involving state transitions influ-
|
|||
|
enced by actions. Then, we give an overview of the two main branches of Rein-
|
|||
|
forcement Learning: value-based methods, which focus on estimating the value of
|
|||
|
states or state-action pairs, and policy-based methods, which directly optimize the
|
|||
|
policy that dictates the agent’s actions.
|
|||
|
|
|||
|
We focus on Proximal Policy Optimization (PPO), which is the de facto baseline
|
|||
|
algorithm in modern RL literature due to its robustness and ease of implementa-
|
|||
|
tion, and discuss its potential advantages, such as improved sample efficiency and
|
|||
|
stability, as well as its disadvantages, including sensitivity to hyper-parameters
|
|||
|
and computational overhead. We emphasize the importance of fine-tuning PPO to
|
|||
|
achieve optimal performance.
|
|||
|
|
|||
|
We demonstrate the application of these concepts within Pneuma, a custom-
|
|||
|
made environment specifically designed for this thesis. Pneuma aims to become
|
|||
|
a research base for independent Multi-Agent Reinforcement Learning (MARL),
|
|||
|
where multiple agents learn and interact within the same environment. We outline
|
|||
|
the requirements for such environments to support MARL effectively and detail
|
|||
|
the modifications we made to the baseline PPO method, as presented by OpenAI,
|
|||
|
to facilitate agent convergence for a single-agent level.
|
|||
|
|
|||
|
Finally, we discuss the potential for future enhancements to the Pneuma envi-
|
|||
|
ronment to increase its complexity and realism, aiming to create a more RPG-like
|
|||
|
setting, optimal for training agents in complex, multi-objective, and multi-step
|
|||
|
tasks.
|