完整代码在这里。Tensorflow>1.4。
1.框架图
为了能够搜索到一个更好的神经网络结构,谷歌提出了<Neural Architecture Search with Reinforcement Learning>,即通过增强学习中的 Policy gradient 算法,从搜索空间中选择更优的网络结构。而这个搜索空间可以是层的个数、激活函数类型、 Dropout比例、CNN中kernel size等等,这些我们都可以认为是神经网络的超参数。
<The First Step-by-Step Guide for Implementing Neural Architecture Search with Reinforcement Learning Using TensorFlow>举了一个例子,说明如何通过强化学习,对 CNN 中的输出维数、1维卷积窗的 kernel size、pool size及 每层的dropout 比例得到最优的参数组合,以得到更优的网络结构。
搜索框架图如下。

强化学习采用策略梯度进行训练,产生动作去修改CNN的结构,即 CNN模型应该采用的参数。同时, CNN模型采用此组参数进行训练,输出准确率作为 Reward。这里,CNN的任务是识别mnist中的数字。
2.策略网络policy network
我们先建立策略网络,以CNN当前状态(这里状态等同于动作)和最大层数作为输入,输出动作去更新CNN模型。
python
def policy_network(state, max_layers):
with tf.name_scope("policy_network"):
nas_cell = tf.contrib.rnn.NASCell(4*max_layers)
outputs, state = tf.nn.dynamic_rnn(
nas_cell,
tf.expand_dims(state, -1),
dtype=tf.float32
)
bias = tf.Variable([0.05]*4*max_layers)
outputs = tf.nn.bias_add(outputs, bias)
print("outputs: ", outputs, outputs[:, -1:, :], tf.slice(outputs, [0, 4*max_layers-1, 0], [1, 1, 4*max_layers]))
return outputs[:, -1:, :]
接下来定义 Reinforce类,以进行参数调整。 ```python class Reinforce(): def init(self, sess, optimizer, policynetwork, maxlayers, globalstep, divisionrate=100.0, regparam=0.001, discountfactor=0.99, exploration=0.3): self.sess = sess self.optimizer = optimizer self.policynetwork = policynetwork self.divisionrate = divisionrate self.regparam = regparam self.discountfactor=discountfactor self.maxlayers = maxlayers self.globalstep = globalstep
self.reward_buffer = []
self.state_buffer = []
这里的一些参数
- division_rate — 每个神经元的正态分布值,从 -1.0到1.0.
- reg_param — 正则化参数.
- exploration — exploration/exploitation中,产生随机动作的概率。
在create_variables中根据policy_network的输出,得到下一步的动作。
```python
def create_variables(self):
with tf.name_scope("model_inputs"):
# raw state representation
self.states = tf.placeholder(tf.float32, [None, self.max_layers*4], name="states")
with tf.name_scope("predict_actions"):
# initialize policy network
with tf.variable_scope("policy_network"):
self.policy_outputs = self.policy_network(self.states, self.max_layers)
self.action_scores = tf.identity(self.policy_outputs, name="action_scores")
self.predicted_action = tf.cast(tf.scalar_mul(self.division_rate, self.action_scores), tf.int32, name="predicted_action")
# regularization loss
policy_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="policy_network")
# compute loss and gradients
with tf.name_scope("compute_gradients"):
# gradients for selecting action from policy network
self.discounted_rewards = tf.placeholder(tf.float32, (None,), name="discounted_rewards")
with tf.variable_scope("policy_network", reuse=True):
self.logprobs = self.policy_network(self.states, self.max_layers)
# compute policy loss and regularization loss
self.cross_entropy_loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logprobs, labels=self.states)
self.pg_loss = tf.reduce_mean(self.cross_entropy_loss)
self.reg_loss = tf.reduce_sum([tf.reduce_sum(tf.square(x)) for x in policy_network_variables])
self.loss = self.pg_loss + self.reg_param * self.reg_loss
#compute gradients
self.gradients = self.optimizer.compute_gradients(self.loss)
# compute policy gradients
for i, (grad, var) in enumerate(self.gradients):
if grad is not None:
self.gradients[i] = (grad * self.discounted_rewards, var)
# training update
with tf.name_scope("train_policy_network"):
# apply gradients to update policy network
self.train_op = self.optimizer.apply_gradients(self.gradients)
一般地,策略梯度可以根据下式计算
math
\nabla_{\theta}J(\theta)=E_{\pi_{\theta}}[\nabla_{\theta}log\pi_{\theta}(s,a)Q^{\pi_{\theta}}(s,a)]
我们需要分别计算$\nabla_{\theta}log\pi_{\theta}(s,a)$和$Q^{\pi_{\theta}}(s,a)$。但是这里,直接根据损失来计算梯度了。
python
self.gradients = self.optimizer.compute_gradients(self.loss)
损失包括了交叉熵损失和正则项。
3.训练CNN模型
这样,每次产生一个动作,就会生成一个新的 CNN 网络模型。因为CNN网络够多,也就有必要作一个 netManager类,以管理这些模型。 ```python class NetManager(): def init(self, numinput, numclasses, learningrate, mnist, maxstepperaction=5500, bathcsize=100, dropoutrate=0.85):
self.num_input = num_input
self.num_classes = num_classes
self.learning_rate = learning_rate
self.mnist = mnist
self.max_step_per_action = max_step_per_action
self.bathc_size = bathc_size
self.dropout_rate = dropout_rate #Dropout after dense layer in CNN
下面根据动作来生成 CNN模型。同时训练这个模型并得到 Reward.
```python
def get_reward(self, action, step, pre_acc):
action = [action[0][0][x:x+4] for x in range(0, len(action[0][0]), 4)]
cnn_drop_rate = [c[3] for c in action]
Then we formed bathc with hyperparameters for every layer in "action" and we created cnn_drop_rate – list of dropout rates for every layer.
Now let's create new CNN with new architecture:
with tf.Graph().as_default() as g:
with g.container('experiment'+str(step)):
model = CNN(self.num_input, self.num_classes, action)
loss_op = tf.reduce_mean(model.loss)
optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
train_op = optimizer.minimize(loss_op)
```
动作 [[10.0, 128.0, 1.0, 1.0]*args.max_layers]分别表示CNN各层中,一维卷积核的数目,及kernel_size,pool_size,dropout层的比率。
### 4.训练
```python
def train(mnist, max_layers):
sess = tf.Session()
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
learning_rate = tf.train.exponential_decay(0.99, global_step,
500, 0.96, staircase=True)
optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate)
reinforce = Reinforce(sess, optimizer, policy_network, args.max_layers, global_step)
net_manager = NetManager(num_input=784,
num_classes=10,
learning_rate=0.001,
mnist=mnist)
MAX_EPISODES = 250
step = 0
state = np.array( [[10.0, 128.0, 1.0, 1.0]*max_layers], dtype=np.float32)
pre_acc = 0.0
for i_episode in range(MAX_EPISODES):
action = reinforce.get_action(state)
print("current action:", action)
if all(ai > 0 for ai in action[0][0]):
reward, pre_acc = net_manager.get_reward(action, step, pre_acc)
else:
reward = -1.0
# In our sample action is equal state
state = action[0]
reinforce.store_rollout(state, reward)
step += 1
ls = reinforce.train_step(MAX_STEPS)
log_str = "current time: "+str(datetime.datetime.now().time())+" episode: "+str(i_episode)+" loss: "+str(ls)+" last_state: "+str(state)+" last_reward: "+str(reward)
print(log_str)
def main():
max_layers = 3
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
train(mnist, max_layers)
if __name__ == '__main__':
main()
```