Hierarchical Reinforcement Learning: 4 Disadvantages of Approaches

Ra Bot
3 min readJan 7, 2022

I won’t go in the banter of (re-)introducing Hierarchical Reinforcement Learning (HRL) to anyone who has clicked on this story. Presuming the requisite background knowledge of why HRL methods are gaining popularity in the machine-learning or AI space and their advantages.

Hierarchical reasoning mimicking human 2 level reasoning (Thinking Fast and Slow by Daniel Kahneman), makes HRL approaches very principled and attractive among ML practitioners

While HRL methods (hierarchical actor-crictic, options framework, option-critic etc.) provide a very principled, formalised approach to solving long-horizon decision making tasks, here I want to highlight the major cons or disadvantages of such approaches.

The intuition of the Options Framework: making a Semi-MDP to MDP sub-problems by defining options over the sub-problems. Think of the SMDP (large) dots as states or mile-stones in achieving an overall goal. For ‘Going on movie date’ example: if you on the last subgoal (4: go to location), you are not worried about the availability of tickets, as that state is complete (or termination condition met with).

As a rolling example, consider the scenario: ‘Going on a movie date’, where you are: ‘planning to watch the new Matrix (Resurrection) movie with a date (or friend)’. What does the planning entail? There are a few high-level sub-goals (or macro actions) you need to accomplish before the end result is achieved (i.e. you’re in the theatre watching it), to exemplify:

  1. Find a suitable (theatre) location .
  2. Schedule a time: this entails finding a show-time when both of your calendars are free.
  3. Buy tickets: some form of payment arrangement
  4. Go to location: walk, ride a bike, take the subway etc.

Again, I won’t digress on the whys of choosing HRL methods for solving such long-horizon goal planning or action execution tasks. Let’s discuss four major disadvantages with elucidation using the rolling example:

  1. Domain knowledge requirement: designing the subgoals requires problem/task specific domain knowledge, and subsequent hand-engineering. In our example, we need to know the theatre locations, the hard constraints of scheduling, buying tickets etc. Designing a hierarchical sub-policy cannot preclude knowing this task specific constraints.
  2. Algorithmic complexity: even with our domain knowledge [1] for the task, specific subgoals must be identified, and sub-policies for those subgoals must be learned. To elucidate, for ‘going on a movie date’ we had to identify the four subgoals (from our domain knowledge), design and learn each subgoal (with each subgoals (plausibly) unique reward and termination constrains).
  3. Computational (combinatorial) complexity: in the HRL realm, the sub-policies for subgoals are usually learnt using a combination of primitive actions (for e.g., ‘motion/navigation primitives’ could be `turn [left|right]` , `move [forward|backward]`; ‘interaction primitives’ could be pick up(<obj>), toggle(<obj>), put(<obj>) etc.). As you can imagine, the combinatorial explosion of such primitives can be infeasible very easily. To grasp a better intuition of this phenomenon, imagine a simple task of picking up a wine glass from a table and then putting it down. The consider planning this task using granular primitives — you can use your right-hand or left; you can use any two fingers of either of your hands; you can slouch, or stand-up straight, or reach the glass in a wavy hand motion. You get the picture.
  4. Efficiency or Optimality: Lastly, HRL approaches cannot guarantee optimality of overall aggregated policy. Using our rolling example, even if the trained HRL policy ensures task completion, it may not be the most efficient way to do it. Each of the subgoals can have more efficient way to accomplish the goal. For intuition, let’s use a simple case: consider the subgoal 4 Go To (Location). Imagine you’ve a theatre right by your place, and you can arrive there with a brisk 10 minutes walk. However, it is also possible to (programmatically) call an Uber ride, that takes 8 minutes to arrive, then 2–5 minutes (depending on traffic lights) to get there, and it also costs you about $10 bucks (if you are generous, maybe $12 with tipping). While both approaches results in the same end-goal, in reality, I can bet $9 bucks that you’d choose the former option every time. For a more technical read, consider reading up on Kolmogorov Complexity — there are some fascinating research on the efficiency of decision making and such.

In summary, I wanted to elucidate some of the cons associated with HRL approaches in a plain, simple language without using the usual (significant) technical jargons of these approaches. Hope you enjoyed your read, your likes will certainly give me the dopamine boosts to write more on this topic.

--

--

Ra Bot

Researcher/Historian [RIT-2119 cohort]. I specialize in classical roboquity era and 4th industrial era robotic evolution. Covering human AI research 2020–2042