The essence of this competition is the victory of the combination of deep learning network (CNN) and Monte Carlo search tree (MCTS), which is the progress of human intelligence. Many so-called "brick" families began to preach that machines would defeat human beings, or even that human beings would be ruled by machines and other ignorant statements, which people really could not see.
As go fans and AI practitioners, we feel it's time to talk about alphago's principles and weaknesses.
We can be very responsible to tell you that alphago has not completely overcome the problem of go, and professional players do not have no hope of winning go, let alone that machines have defeated humans. Alphago has a long way to go in the future.
If any Chinese professional chess player wants to challenge alphago, we are willing to set up the top (and know go) AI expert consultant team for him to help him win alphago.
Although there are many technical posts on the Internet, not a single article has fully explained the principle of aphago, and the articles published on nature also lack a map to understand the whole situation (in addition to the description in English, it is difficult for students to understand thoroughly).
Here is a picture that we finished after reading the original text many times and collecting a lot of other materials. It explains the principle of alphago. After reading it, we naturally know where its weakness is.
Schematic of alphago
Alphago generally consists of two processes: offline learning (top half of the picture) and online game (bottom half of the picture).
The offline learning process is divided into three training stages.
The first stage: train two networks by using more than 30000 chess scores of professional players.
One is a policy network based on the global feature and deep convolution network (CNN). Its main function is to give the current disk state as the input, and output the next move's drop probability in other spaces of the board.
The other is the rollout policy, which is trained by local characteristics and linear model.
The strategy network speed is slow, but the accuracy is high; the fast move strategy is the opposite.
The second stage: using the t-round strategy network and the previously trained strategy network to play chess with each other, using reinforcement learning to modify the parameters of the t-round strategy network, and finally get the enhanced strategy network.
This part is greatly advocated by many "brick" families, but there should be a theoretical bottleneck (limited promotion capacity).
It's like two 6-year-old children playing chess constantly, and their level will reach the level of occupation 9?
The third stage: first use the common strategy network to generate the first U-1 step of the game (U is a random variable belonging to [1, 450]), then use the random sampling to determine the position of the u step (this is to increase the diversity of the game and prevent over fitting).
Then, the enhanced strategy network is used to complete the later self game process until the end of the game. After that, the disk in step u is used as the feature input and the winning or losing is used as the label to learn a value network, which is used to determine the winning or losing probability of the result.
The value network is actually a major innovation of alphago. The most difficult thing about go is that it is difficult to judge the final result according to the current situation, which is also difficult for professional players to master.
Through a large number of self games, alphago has produced 30 million chess games, which are used as a network of training and learning value. However, due to its large search space, 30 million chess games can not help alphago completely overcome this problem.
The online game process includes five key steps: the core idea is to embed deep neural network in the Monte Carlo search tree (MCTS) to reduce the search space. Alphago doesn't have real thinking skills.
1. Extract the corresponding features according to the situation that the current disk surface has fallen;
2. Using strategy network to estimate the fall probability of other spaces on the chessboard;
3. Calculate the weight of the development here according to the drop probability. The initial value is the drop probability itself (for example, 0.18). The actual situation may be a function with the probability value as the input, so it is easy to understand here.
4. Use the value network and the fast chess network to judge the situation respectively, and the scores of the two situations add up to the scores of the last chess game winner here.
Here, using the fast move strategy is a way to trade speed for quantity. From the judged position, fast move to the end, after each move, there will be a win or loss result, and then the corresponding winning rate of this node will be comprehensively counted.
The value network can directly evaluate the final result according to the current state. They have their own advantages and disadvantages and complement each other.
5. Use the score calculated in step 4 to update the weight of the previous move position (e.g. from 0.18 to 0.12); after that, continue to search and update from the side with the largest weight of 0.15.
These weights should be updated in parallel. When the number of visits to a node exceeds a certain threshold, the next level of search is further expanded on the Monte Carlo tree.
MCTs expands next level nodes
What are alphago's weaknesses?
1. Attack the strategic network and increase the search space.
After entering the mid game, if the professional players can establish a more complex situation, each move involves the fate of many local chess (to avoid single block, local combat), then alphago needs to search space, which will sharply increase, and the accuracy of the solution obtained in a short time will be greatly reduced.
Li Shishi's fourth game of the ninth paragraph of chess has this meaning. There are 5 pieces of black-and-white chess related to each other. After one white game, black chess needs to consider many places.
Many places need to be searched in depth on MCTs. In order to have results in a certain period of time, we can only give up the search accuracy.
Li Shishi's chess manual for alphago's fourth game
2. To attack its value network, there will be no end to it.
Alphago's value network greatly improves the accuracy of situation judgment based on MCTs, but there is still a long way to go. Neural network can not completely avoid some weird (or even wrong) judgments at some times, and its training samples are far from enough.
That's why value networks still need to rely on fast moves to judge the situation.
Everyone once doubted alphago's ability to rob, and felt that alphago had the sign to avoid looting. In fact, Professor Zhou Zhihua of Nanjing University once wrote an article to point out that robbery will make the value network collapse, and the principle is no longer repeated.
In a word, it's too late and too early to rob. Even if the value network fails, it can be made up by the fast chess network.
It's better to start looting in the period of just entering the mid market (it's not enough to rob too early), and keep it for a long time. It's better to have more than two robberies on the market at the same time.
Without the value network, alphago's level is actually about 3 career segments.
Zheng Yu (Ph.D., Professor, doctoral supervisor), director of Microsoft Asia Research Institute, director of urban computing, editor in chief of ACM transactions on intelligent systems and technology, was judged by MIT Technology Review in 2013 as global outstanding young innovator (MIT TR35), Secretary General of ACM data mining China branch.
Zhang Junbo (Ph.D.), associate researcher of Microsoft Asia Research Institute and member of urban computing group, is engaged in the field of deep learning.