This post is based on journal paper “Reinforcement Learning in Reliability and Maintenance Optimization: A Tutorial”

Risk of system degradation

Engineered systems inevitably experience degradation, leading to unexpected shutdowns and significant adverse consequences.

Why use Reinforcement learning (RL) to enhance system reliability ?

Markov decision process (MDP) provides an analytic framework for reliability and maintenance optimization, in which the system state degrades over operation and the optimal actions of each state are identified. However, solving MDPs typically faces scalability and dimensionality issues. The computational burden associated with traditional algorithms (e.g., dynamic programming) significantly increases as the problem size grows, thus limiting their applicability in engineering instances. Reinforcement learning (RL), a critical paradigm in the field of machine learning, has emerged as a powerful and effective tool for solving MDPs. RL is an efficient framework that focuses on sequential decision-making problems to determine the optimal actions to maximize cumulative reward. RL utilizes a trial-and-error learning paradigm, gaining knowledge through interactions with environment and shaping behavior with rewards and punishments. Its powerful mechanism for learning optimal actions from interaction experiences enables it to operate in a model-free manner and eliminate the need for pre-labeled data. This attribute confers a high degree of adaptability and flexibility across various complicated engineering scenarios. The RL paradigm reflects advances in computational capabilities and algorithmic strategies that have overcome limitations related to scalability and dimensionality in MDPs.

Additionally, the advances in deep learning have further enhanced RL, culminating in the transformative paradigm known as deep reinforcement learning (DRL). DRL leverages the efficient function approximation and generalization capabilities of deep neural networks to handle high-dimensional and continuous state spaces effectively. Furthermore, an additional advantage of integrating with deep neural networks is the capability to process diverse types of input data, including images, text, and audio, thereby extending DRL’s applicability to a broader spectrum of engineering scenarios. The strong capabilities of RL and DRL in addressing complicated and large-scale problems has rendered them among the most prevalent tools to the reliability and maintainability community.

Markov Decision Process

Identify all possible states of the system. These states can be determined based on historical performance data and expert knowledge
Identify the actions that can influence the system’s state (system reconfiguration / mission abort / loading distribution)
Define the probabilities of transitioning from one state to another given a specific action. These probabilities can be estimated using historical mission data, failure data or other relevant information
Develop a reward function that quantifies the benefits for all state and action pairs. The function includes cost elements such as operation cost, mission profit and failure penalties.

Formation for mission abort problems

Suppose a system is mean to perform a specific mission. The system state degrades over operation and degraded states are denoted as ${1,…,n}$ distinguished by its performance capacity. State $1$ and state $n $ are the completely failed state and perfectly functioning state, respectively. The performance capacity of the system in state $i$ is denoted as $g_i (i\in 1, …, n)$ and one has $g_i < g_j$, $\forall i<j$. The system undergoes periodic inspection with equal time interval $\Delta t$, and the system state at the beginning of each time step can be precisely observed. The state and performance capacity of the system at time step $t$ are represented as $X_t$ and $G_t$ , respectively. The degradation process of the system is characterized by a homogeneous discrete-time Markov chain with state transition matrix $P=[p_{i,j}]{m \times n}$, where $p{i,j} (i,j\in 1, …, n)$ denotes the state transition probability from sate $i$ to state $j$ . $P$ is a lower triangular matrix that the state of the system cannot be restored. The state transition matrix can be evaluated by the system’s historical observation data.

At each time $t$, the decision maker has the option to either abort the mission or continue it. The cost of mission failure and system failure (state 0) are denoted as $c_m$ and $c_f$, respectively. If the selected action is aborting the mission, denoted as $a_0$, then a mission failure cost $c_m$ is incurred. If the selected action is continuing the mission, denoted as $a_1$, the system will continue to execute the mission till the next inspection, and an operation cost $c_{op}$ is incurred. If the system fails during mission, then the mission failure cost and system failure cost $m_f+m_c$ are incurred. The success of mission depends on the system cumulative performance and demand. The cumulative performance of the system at time step $t$ is denoted as $\psi_t$, and one has $\psi t=\sum^t{t=1}G_t\Delta t$.

If the cumulative performance of the system exceeds the pre-specified demand $D$ within mission duration $T$, i.e., $\psi_ t\geq D, \exists t \in [1,T]$, the mission succeeds and a profit $r_m$ is yielded. Conversely, if the system is in an operation state throughout the mission duration $T$ but the cumulative performance is less than the demand, i.e., $\psi _t < D$, a mission failure cost $c_m$ will be incurred.

State space

The state space consists of them system degraded state, system cumulative performance, and the remaining mission time, i.e., $s_t \in {(X_t,G_t,\delta_t)|0<X_t\leq n, 0\leq G_t<D,0\leq \delta_t<T}\bigcup{\phi}$, where $\phi$ is the terminal state which contains four cases:

the system fails during mission

the mission successes, i.e., the system cumulative performance exceeds the demand

the time in the mission reach the maximum allowable mission time

the mission is aborted.

Action space

The action space can be denoted as ${a_0,a_1}$, where $a_0$ and $a_1$ represents the actions to abort and continue the mission, respectively.

Reward

$$
r_t(s_t,a_t,s_{t+1})=\begin{cases}
-c_m & a_t = a_0 \\
r_m - c_{op} & a_t=a_1, G_t \geq D \\
-c_{op}-c_m-c_f & a_t=a_1, G_{t+1}<D,X_{t+1}=0 \\
-c_{op}-c_m & a_t=a_1, G_{t+1}<D,X_{t+1}>0,t=T-1 \\
-c_{op} & others
\end{cases}
$$

It is noted that the reward function is related not only to recurrent state and the selected action, but also to the system’s degraded state and mission progress at the next time step. If the system fails at the next time step, the mission failure cost and system failure cost are incurred, and if the mission succeeds at the next time step, a profit is yielded.

Case study

Gas pipeline

A gas pipeline system transmission mission abort example is given to exemplify the application of RL algorithms in reliability optimization problems. The pipeline system possesses four possible discrete states, namely, no damage (perfectly functioning state), rupture (completely failed state), and four flaw states with different levels.

The transmission capacity of pipeline system in different states is $0, 1, 2, 3, 4, 5, \times10^5:m^3/day$. The mission of the pipeline system is to transmit $D=2.5\times10^6:m^3$ gas within one week. The operation cost of the pipeline system is $c_{op}=2\times10^4$ US dollars/day. The successful completion of the transmission mission will lead to a profit of $r_m=6\times10^5$ US dollars, otherwise a loss of $c_m=4\times10^5$ US dollars. If the pipeline fails during the mission, in addition to the mission failure cm , a loss of $c_f=1\times10^5$ US dollars related to gas leakage will be incurred. The degradation profile of the pipeline system is characterized by a homogenous discrete-state Markov chain, and the transition probability matrix of one day is:
$$
P= \begin{bmatrix}
1 & 0 & 0 & 0 & 0 & 0 \\
0.2 & 0.8 & 0 & 0 & 0 & 0 \\
0.1 & 0.1 & 0.8 & 0 & 0 & 0 \\
0.02 & 0.05 & 0.08 & 0.85 & 0 & 0 \\
0 & 0.02 & 0.05 & 0.08 & 0.85 & 0 \\
0 & 0 & 0.02 & 0.03 & 0.05 & 0.9 \\
\end{bmatrix}
$$
Use different algorithms (Q-learning / DQN / Value iteration) to solve this problem and compare their results.