6 July 2021 / REINFORCEMENT

Reinforcement Learning - Logistics AI

창고에서 일하는 로봇

import numpy as np
import pandas as pd

gamma = 0.75 # 할인계수
alpha = 0.9 # 학습률

환경구성

# 상태정의
location_to_state = {'A' : 0, 'B' : 1, 'C' : 2, 'D' : 3, 'E' : 4, 'F' : 5, 'G' : 6,
                    'H' : 7, 'I' : 8, 'J' : 9, 'K' : 10, 'L' : 11}

location_to_state

{'A': 0,
 'B': 1,
 'C': 2,
 'D': 3,
 'E': 4,
 'F': 5,
 'G': 6,
 'H': 7,
 'I': 8,
 'J': 9,
 'K': 10,
 'L': 11}

# 행동 정의
actions = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

# 보상 정의
R = np.array([[0,1,0,0,0,0,0,0,0,0,0,0],
             [1,0,1,0,0,1,0,0,0,0,0,0],
             [0,1,0,0,0,0,1,0,0,0,0,0],
             [0,0,0,0,0,0,0,1,0,0,0,0],
             [0,0,0,0,0,0,0,0,1,0,0,0],
             [0,1,0,0,0,0,0,0,0,1,0,0],
             [0,0,1,0,0,0,1000,1,0,0,0,0],
              [0,0,0,1,0,0,1,0,0,0,0,1],
              [0,0,0,0,1,0,0,0,0,1,0,0],
              [0,0,0,0,0,1,0,0,1,0,1,0],
              [0,0,0,0,0,0,0,0,0,1,0,1],
              [0,0,0,0,0,0,0,1,0,0,1,0]])

Q-Learning AI Solution

Q = np.array(np.zeros([12, 12]))
for i in range(1000):
    current_state = np.random.randint(0, 12)
    playable_actions = []
    for j in range(12):
        if R[current_state, j] > 0:
            playable_actions.append(j)
    next_state = np.random.choice(playable_actions)
    TD = R[current_state, next_state] + gamma * Q[next_state, np.argmax(Q[next_state, ])] - Q[current_state, next_state]
    Q[current_state, next_state] = Q[current_state, next_state] + alpha * TD

Q-value

pd.DataFrame(Q).astype(int)

	0	1	2	3	4	5	6	7	8	9	10	11
0	0	1684	0	0	0	0	0	0	0	0	0	0
1	1260	0	2244	0	0	1246	0	0	0	0	0	0
2	0	1684	0	0	0	0	2992	0	0	0	0	0
3	0	0	0	0	0	0	0	2243	0	0	0	0
4	0	0	0	0	0	0	0	0	707	0	0	0
5	0	1679	0	0	0	0	0	0	0	945	0	0
6	0	0	2242	0	0	0	3988	2215	0	0	0	0
7	0	0	0	1672	0	0	2991	0	0	0	0	1677
8	0	0	0	0	531	0	0	0	0	945	0	0
9	0	0	0	0	0	1260	0	0	707	0	1255	0
10	0	0	0	0	0	0	0	0	0	944	0	1678
11	0	0	0	0	0	0	0	2236	0	0	1258	0

시작

state_to_location = {state: location for location, state in location_to_state.items()}

AI는 시작위치 E에서 출발한다.
AI는 위치 E에 해당하는 상태를 얻는다. location_to_state 매핑에 따르면 $s_{0} = 4$ 이다.
Q-Value 행렬에서 인덱스가 $s_{0} =4$인 행에서 AI는 Q-value가 최대(707)인 열을 선택한다.
이 열의 인덱스는 8이므로 AI는 인덱스가 8인 행동을 수행하며 이로써 다음 상태 $s_{t+1} = 8$이 된다.
AI는 상태 8의 위치를 얻게 되며 state_to_location 매핑에 따라 I에 위치한다. 다음 위치가 I이므로 I가 최적 경로를 포함하고 있는 AI의 리스트에 포함된다.
새로운 위치 I에서 시작해 최종 목적지인 G에 도달할 때까지 이전 단계를 반복한다.

def route(starting_location, ending_location):
    route = [starting_location]
    next_location = starting_location
    while (next_location != ending_location):
        starting_state = location_to_state[starting_location]
        next_state = np.argmax(Q[starting_state, ])
        next_location = state_to_location[next_state]
        route.append(next_location)
        starting_location = next_location
    return route

route('E', 'G')

['E', 'I', 'J', 'F', 'B', 'C', 'G']

개선

보상 부여 자동화

def route(starting_location, ending_location):
    # ending_location을 통해 해당 셀의 보상을 1000으로 업데이트 한다.
    R_new = np.copy(R)
    ending_state = location_to_state[ending_location]
    R_new[ending_state, ending_state] = 1000
    
    # 보상행렬의 사본의 보상을 업데이트한 다음 전체 Q-learning을 포함시켜야 한다.
    Q = np.array(np.zeros([12, 12]))
    for i in range(1000):
        current_state = np.random.randint(0, 12)
        playable_actions = []
        for j in range(12):
            if R_new[current_state, j] > 0:
                playable_actions.append(j)
        next_state = np.random.choice(playable_actions)
        TD = R_new[current_state, next_state] + gamma * Q[next_state, np.argmax(Q[next_state, ])] - Q[current_state, next_state]
        Q[current_state, next_state] = Q[current_state, next_state] + alpha * TD
    route = [starting_location]
    next_location = starting_location
    while (next_location != ending_location):
        starting_state = location_to_state[starting_location]
        next_state = np.argmax(Q[starting_state, ])
        next_location = state_to_location[next_state]
        route.append(next_location)
        starting_location = next_location
    return route

중간 목표 추가

시작, 중간, 종료 위치의 세개의 입력을 취하는 추가적인 best_route() 함수를 만든다. 이 함수는 시작 위치에서 중간 위치로 이동할 때와 중간 위치에서 종료 위치로 이동할 때에 이전에 만든 route()함수를 두번 호출한다.

def best_route(starting_location, intermediary_state, ending_location):
    return route(starting_location, intermediary_state) + route(intermediary_state, ending_location)[1:]

best_route('E','K','G')

['E', 'I', 'J', 'K', 'L', 'H', 'G']

참고 : 아들랑 드 폰테베 『강화학습/심층강화학습특강』, 위키북스(2021), p86-107

Reinforcement Learning - Logistics AI

창고에서 일하는 로봇

환경구성

Q-Learning AI Solution

Q-value

시작

개선

보상 부여 자동화

중간 목표 추가

Reinforcement Learning - Deep Q-Learning

Reinforcement Learning - Q-learning