一步一步实现神经网络结构搜索(Neural Architecture Search,NAS)

完整代码在这里。Tensorflow>1.4。

1.框架图

为了能够搜索到一个更好的神经网络结构,谷歌提出了<Neural Architecture Search with Reinforcement Learning>,即通过增强学习中的 Policy gradient 算法,从搜索空间中选择更优的网络结构。而这个搜索空间可以是层的个数、激活函数类型、 Dropout比例、CNN中kernel size等等,这些我们都可以认为是神经网络的超参数。

<The First Step-by-Step Guide for Implementing Neural Architecture Search with Reinforcement Learning Using TensorFlow>举了一个例子,说明如何通过强化学习,对 CNN 中的输出维数、1维卷积窗的 kernel size、pool size及 每层的dropout 比例得到最优的参数组合,以得到更优的网络结构。

搜索框架图如下。

image.png

强化学习采用策略梯度进行训练,产生动作去修改CNN的结构,即 CNN模型应该采用的参数。同时, CNN模型采用此组参数进行训练,输出准确率作为 Reward。这里,CNN的任务是识别mnist中的数字。

2.策略网络policy network

我们先建立策略网络,以CNN当前状态(这里状态等同于动作)和最大层数作为输入,输出动作去更新CNN模型。 python def policy_network(state, max_layers): with tf.name_scope("policy_network"): nas_cell = tf.contrib.rnn.NASCell(4*max_layers) outputs, state = tf.nn.dynamic_rnn( nas_cell, tf.expand_dims(state, -1), dtype=tf.float32 ) bias = tf.Variable([0.05]*4*max_layers) outputs = tf.nn.bias_add(outputs, bias) print("outputs: ", outputs, outputs[:, -1:, :], tf.slice(outputs, [0, 4*max_layers-1, 0], [1, 1, 4*max_layers])) return outputs[:, -1:, :]

接下来定义 Reinforce类,以进行参数调整。 ```python class Reinforce(): def init(self, sess, optimizer, policynetwork, maxlayers, globalstep, divisionrate=100.0, regparam=0.001, discountfactor=0.99, exploration=0.3): self.sess = sess self.optimizer = optimizer self.policynetwork = policynetwork self.divisionrate = divisionrate self.regparam = regparam self.discountfactor=discountfactor self.maxlayers = maxlayers self.globalstep = globalstep

    self.reward_buffer = []
    self.state_buffer = []
这里的一些参数
- division_rate — 每个神经元的正态分布值,从 -1.0到1.0.
- reg_param — 正则化参数.
- exploration — exploration/exploitation中,产生随机动作的概率。

在create_variables中根据policy_network的输出,得到下一步的动作。
```python
    def create_variables(self):
        with tf.name_scope("model_inputs"):
            # raw state representation
            self.states = tf.placeholder(tf.float32, [None, self.max_layers*4], name="states")

        with tf.name_scope("predict_actions"):
            # initialize policy network
            with tf.variable_scope("policy_network"):
                self.policy_outputs = self.policy_network(self.states, self.max_layers)

            self.action_scores = tf.identity(self.policy_outputs, name="action_scores")

            self.predicted_action = tf.cast(tf.scalar_mul(self.division_rate, self.action_scores), tf.int32, name="predicted_action")


        # regularization loss
        policy_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="policy_network")

        # compute loss and gradients
        with tf.name_scope("compute_gradients"):
            # gradients for selecting action from policy network
            self.discounted_rewards = tf.placeholder(tf.float32, (None,), name="discounted_rewards")

            with tf.variable_scope("policy_network", reuse=True):
                self.logprobs = self.policy_network(self.states, self.max_layers)

            # compute policy loss and regularization loss
            self.cross_entropy_loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logprobs, labels=self.states)
            self.pg_loss            = tf.reduce_mean(self.cross_entropy_loss)
            self.reg_loss           = tf.reduce_sum([tf.reduce_sum(tf.square(x)) for x in policy_network_variables])
            self.loss               = self.pg_loss + self.reg_param * self.reg_loss

            #compute gradients
            self.gradients = self.optimizer.compute_gradients(self.loss)

            # compute policy gradients
            for i, (grad, var) in enumerate(self.gradients):
                if grad is not None:
                    self.gradients[i] = (grad * self.discounted_rewards, var)

            # training update
            with tf.name_scope("train_policy_network"):
                # apply gradients to update policy network
                self.train_op = self.optimizer.apply_gradients(self.gradients)

一般地,策略梯度可以根据下式计算 math \nabla_{\theta}J(\theta)=E_{\pi_{\theta}}[\nabla_{\theta}log\pi_{\theta}(s,a)Q^{\pi_{\theta}}(s,a)] 我们需要分别计算$\nabla_{\theta}log\pi_{\theta}(s,a)$$Q^{\pi_{\theta}}(s,a)$。但是这里,直接根据损失来计算梯度了。 python self.gradients = self.optimizer.compute_gradients(self.loss) 损失包括了交叉熵损失和正则项。

3.训练CNN模型

这样,每次产生一个动作,就会生成一个新的 CNN 网络模型。因为CNN网络够多,也就有必要作一个 netManager类,以管理这些模型。 ```python class NetManager(): def init(self, numinput, numclasses, learningrate, mnist, maxstepperaction=5500, bathcsize=100, dropoutrate=0.85):

    self.num_input = num_input
    self.num_classes = num_classes
    self.learning_rate = learning_rate
    self.mnist = mnist

    self.max_step_per_action = max_step_per_action
    self.bathc_size = bathc_size
    self.dropout_rate = dropout_rate #Dropout after dense layer in CNN

 下面根据动作来生成 CNN模型。同时训练这个模型并得到 Reward.
 ```python
 def get_reward(self, action, step, pre_acc):
        action = [action[0][0][x:x+4] for x in range(0, len(action[0][0]), 4)]
        cnn_drop_rate = [c[3] for c in action]
Then we formed bathc with hyperparameters for every layer in "action" and we created cnn_drop_rate – list of dropout rates for every layer. 
Now let's create new CNN with new architecture: 
        with tf.Graph().as_default() as g:
            with g.container('experiment'+str(step)):
                model = CNN(self.num_input, self.num_classes, action)
                loss_op = tf.reduce_mean(model.loss)
                optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
                train_op = optimizer.minimize(loss_op)
  ``` 
  动作 [[10.0, 128.0, 1.0, 1.0]*args.max_layers]分别表示CNN各层中,一维卷积核的数目,及kernel_size,pool_size,dropout层的比率。 
  ### 4.训练
  ```python
  def train(mnist, max_layers):
    sess = tf.Session()
    global_step = tf.Variable(0, trainable=False)
    starter_learning_rate = 0.1
    learning_rate = tf.train.exponential_decay(0.99, global_step,
                                           500, 0.96, staircase=True)

    optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate)

    reinforce = Reinforce(sess, optimizer, policy_network, args.max_layers, global_step)

    net_manager = NetManager(num_input=784,
                             num_classes=10,
                             learning_rate=0.001,
                             mnist=mnist)

    MAX_EPISODES = 250
    step = 0
    state = np.array( [[10.0, 128.0, 1.0, 1.0]*max_layers], dtype=np.float32)
    pre_acc = 0.0
    for i_episode in range(MAX_EPISODES):
  
            action = reinforce.get_action(state)
            print("current action:", action)
            if all(ai > 0 for ai in action[0][0]):
                reward, pre_acc = net_manager.get_reward(action, step, pre_acc)
            else:
                reward = -1.0

            # In our sample action is equal state
            state = action[0]
            reinforce.store_rollout(state, reward)

            step += 1
        ls = reinforce.train_step(MAX_STEPS)
        log_str = "current time:  "+str(datetime.datetime.now().time())+" episode:  "+str(i_episode)+" loss:  "+str(ls)+" last_state:  "+str(state)+" last_reward:  "+str(reward)
        print(log_str)

def main():
    max_layers = 3
    mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
    train(mnist, max_layers)

if __name__ == '__main__':
  main()
```