Provably Efficient and Trustworthy Reinforcement Learning

Open Access
- Author:
- Huang, Ruiquan
- Graduate Program:
- Electrical Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- February 17, 2025
- Committee Members:
- Madhavan Swaminathan, Program Head/Chair
Minghui Zhu, Major Field Member
Mehrdad Mahdavi, Major Field Member
Jia Li, Outside Unit & Field Member
Jing Yang, Chair & Dissertation Advisor - Keywords:
- Reinforcement Learning
Bandits
Federated Learning
Differential Privacy
Safety
Multi-task Learning
POMDP
Reward-free RL
Low-rank MDPs
Predictive State Representations
PSRs
Regret
Sample Complexity
Optimism
Upper Confidence Bound
UCB
Offline RL - Abstract:
- This dissertation studies reinforcement learning (RL), where an agent sequentially interacts with the environment and receives reward. The goal of the agent is to maximize the performance characterized by the cumulative reward or the value function. While RL has achieved tremendous successes in games and other simulated scenarios, its adoption in real systems is still quite limited, due to various considerations and constraints in practical applications. This work addresses these challenges by designing RL algorithms that achieve both efficiency by minimizing regret and sample complexity, and trustworthiness by ensuring compliance with stringent privacy and safety constraints. First, we consider contextual bandits in federated setting. Contextual bandits is a special RL problem, where there is only one step in each episode. In a federated setting, each individual agent or client faces different bandits problem coupled through common global parameters. In Chapter 2, by using the geometric structure of the rewards, a collaborative algorithm called Fed-PE is proposed to cope with the heterogeneity across clients without exchanging local feature vectors or raw data. Fed-PE relies on a novel multi-client G-optimal design, and achieves near-optimal regrets in the sense that the regret upper bound matches the lower bound up to logarithm factors. To build a private and trustworthy system, in Chapter 3, we study federated linear contextual bandits under the notion of user-level differential privacy (DP). We first introduce a unified federated bandits framework that can accommodate various definitions of DP in the sequential decision-making setting. Then, we propose a federated algorithm termed as ROBIN and show that it is near-optimal in terms of the number of clients $M$ and the privacy budget $\varepsilon$ by deriving nearly-matching upper and lower regret bounds when user-level DP is satisfied. Then, we study rewrad-free RL, where an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In Chapter 4, we focus on reward-free RL under low-rank Markov decision processes (MDPs) models, in which the transitions can be decomposed into the inner product of two feature representations. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this chapter, we first provide the sample complexity lower bound that holds for any algorithm under low-rank MDPs. Then, we propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity nearly matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ denote the representation dimension and action space cardinality, respectively. In Chapter 5, we focus on the safe exploration in reward-free RL, where the agent needs to abide by certain safety constraints at exploration phase. To achieve this goal, we propose a unified algorithm called SWEET, which leverages the concavity and continuity of the newly introduced truncated value functions, and is guaranteed to achieve zero constraint violation during exploration with high probability. Then, we particularize the SWEET framework to the tabular and the low-rank MDP settings. Surprisingly, up to some constants factor, we prove that the sample complexity of SWEET matches that of the constraint-free counterpart. Finally, we investigate RL modeled by non-Markovian decision process, where the next observation is determined by the entire history of observations and actions. In Chapter 6, we consider predictive state representations (PSRs), a general model that includes MDPs and partially observable MDPs (POMDPs) as special cases. We propose the first known upper confidence bound (UCB) based approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models. We further characterize the sample complexity bounds for our designed UCB-type algorithms for both online and offline PSRs. In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy. In Chapter 7, we study how to further improve the sample complexity of learning PSRs through multi-task learning. The main challenge here is that the large and complex model space makes it hard to identify what types of common latent structure of multi-task PSRs can reduce the model complexity and improve sample efficiency. To this end, we posit a joint model class for tasks and use the notion of $\eta$-bracketing number to quantify its complexity; this number also serves as a general metric to capture the similarity of tasks and thus determines the benefit of multi-task over single-task RL. Specifically, we first propose a provably efficient algorithm for upstream multi-task learning over PSRs, in which all tasks only share the same observation and action spaces. The advantage of multi-task learning manifests if the joint model class of PSRs has a smaller $\eta$-bracketing number compared to that of individual single-task learning. We further investigate downstream learning, in which the agent needs to learn a new target task that shares some commonalities with the upstream tasks via a similarity constraint. By exploiting the learned PSRs from the upstream, we develop a sample-efficient algorithm that provably finds a near-optimal policy.