top of page

LLm Q* (Q-star) and Q-learning

What is Q* and Q-learning? What is its relationship to DBZ Q* and Comparisons?

State Reward Agent Action Environment Q-learning sample cycle. Source: OpenAI
State Reward Agent Action Environment Q-learning sample cycle.

The diagram shows the environmental cycle, which demonstrates how the input is processed into a result and then loops back to input range.

Q-learning is a popular reinforcement learning technique used in modern AI systems. It operates on a trial-and-error approach where an AI agent learns to optimize its actions in a particular environment to maximise long-term rewards.

Think of the AI agent as a decision-maker that navigates a complex landscape, where each action has a potential positive or negative outcome. The techniques logic drives the gaming world and the behaviour of autonomous agents with Humans in the loop augmenting decisions for rewards. So a reward could be a token or a larger reward such as a new level.

Q-learning provides a framework for the AI to evaluate its choices and refine its strategy over time. The results leading to more informed and impactful decisions with experience. This self-learning Operand ability has broad applications. Think of this as the Operations procedures manual with a team reading and refining then sending forward to update and being paid. If generalised it would be the "Department of Quality Assurance & Improvement" for streamlining business operations to creating personalized customer experiences.

Operands are terms or expressions used in algebra, arithmetic, or other mathematical operations. It can be a single number, variable, or more complex expression. Operands are typically specified in the order in which they are to be performed on, following the rules of the specific operation being performed. Operands can be used in a variety of mathematical contexts, such as calculating the result of a function or solving an equation.


  • Makes Operands faster.

  • Provides a Before the Operand was applied and After "State" once a cycle is completed

  • Can be applied as an Inline Process or a Call.


  • However, Q-learning focuses on maximizing rewards without necessarily considering broader ethical impacts.

  • Compute Hungry

  • Added Complexity

Whats a real world or better still a Historical Use Case of Q-learning?

Q-learning has been applied as a natural improvement within large language model methods. For example, OpenAI's sample open-source model from 2018 utilises Q-learning. A comparison shows the differences between a large language model example (GPT-1?) architecture and a the DBZ model-less version. This sample architecture used a Gaming output to evaluate coherence results (the stickman picked his game from being drunk to in control).

After authoring the SHE Zen AI Q* algorithm refinement lead to questions about how do llms use Q-learning? This table compares that 2018 sample LLM schema techniques to show differences. Both enhance performance depending on how the functions are applied.

SHE ZenAI addresses the Con's by directly integrating ethical considerations and human well-being into its decision Q-learning. So a function path going beyond traditional Q-learning methods. Unlike the llm approach, which often places ethical considerations as afterthoughts or additional layers, SHE ZenAI considers ethics and human welfare as part of its core decision-making process.


OpenAI 2018 Schema




A core RL algorithm

Integrated with DBZ Q* algorithm for ethical decision-making

Both utilize Q-learning techniques but DBZ Q* embeds at Ops level.

Model-Free RL

RL algorithms

Trinity* Core algorithm is model-free

Both incorporate a model-free RL approach.

Model-Based RL

RL algorithms

Used as LTMS

llm is the Core. llm ancillory memory

Policy Optimization

RL algorithms

Defined algorithmn parameters

Use Policies for different purposes

Learn the Model

Included as an approach under Model-Based RL

K* algorithm handles knowledge management and learning.

Both involve learning/updating, but K* focuses on knowledge representation rather than environment model learning.

Given the Model

Included as an approach under Model-Based RL


SHE ZenAI does not rely on being given pre-defined models

Policy Gradient

Lists specific algorithms like A2C/A3C, PPO




Included as a Q-learning algorithm

Q* builds upon Q-learning

Both utilize Q-learning foundations, but SHE ZenAI's Q* algorithm expands on it significantly.


Test Rig visual training simulation




Included as Q-learning algorithm variations




Included as model-based RL algorithms




Included as policy optimization algorithms


Original SHE specific Function Operand calls

This holistic approach ensures that SHE ZenAI possesses the knowledge, optimisation capabilities, and ethical grounding to make intelligent and human-centric choices.

Like to know more about Q-learning? We have a 3 Level "Dummies: 101", "Try Me, with More Tech" and "Bite Me, its getting Technical" Guides assembled while designing Omega* and in particular DBZ's Q*'s version operands. We will put the technical versions in a section in the forum.

References :

1. Design By Zen [SHE is Zen AI]

1. Towards Characterizing Divergence in Deep Q-Learning [Joshua Achiam 1 2 Ethan Knight 1 3 Pieter Abbeel 2 4] 21-03-20


Author Bio:

David W. Harvey, CEO of Design By Zen, merges 43 years of IT and high-tech design expertise with groundbreaking innovation. Inventor of the DBZ Comfort Index, Holistic Objectives algorithm, and the pioneering Social Harmony Ecosystem or Engine -SHE ZenAI architecture, David's work also includes the world's first intelligent earthquake table -EQ1. Holder of multiple international patents, his professional excellence parallels a fervent interest in exotic cars & simulation engineering. Off-screen, David finds balance in cultivating a Zen garden, reflecting his philosophy of harmony in technology and life through art.

8 views0 comments


Learn More
bottom of page