Sihai network

Amd Ruilong thread Ripper 3970x first test and stable sitting at the top of hedt platform

Ruilong thread Ripper processor is prepared by AMD for hedt and workstation, and its direct competitor is Intel Core X series processors, however, the core number of competitors has been maintained at the maximum of 18 cores for many years, while the core number of sharp dragon thread rippers in AMD has been relatively radical. The first generation is 16 cores, and the second generation directly doubles to 32 cores, which leads to Intel's 28 core Xeon The w-3175x has been introduced to the consumer market. Now the third generation ripper is coming. The first products on the market are only 24 core and 32 core products, and there will be more cores in the future. Intel's new generation cascade a few hours ago The lake-x processor is only 18 cores at most, but the price is completely reduced compared with that of the previous generation when it was released, which is entirely due to the huge pressure exerted by AMD's sharp dragon thread Ripper processor on the opponent. In fact, I was a little surprised when amd launched the 16 core Ruilong 9 3950x processor, because it has already put the hedt level processor into the mainstream platform, and some of its performance is really comparable to Intel's core i9-9980xe, they are not afraid to affect the sales volume of their own Ruilong thread Ripper? However, there are obvious differences between x570 platform and x399 platform in terms of expansion capability. The former CPU and chipset add up 36 PCI-E channels available, and the latter CPU alone can provide 60 PCI-E channels available. The new trx40 platform can provide more pci-e. moreover, the third generation of sharp dragon thread Ripper has no 16 core products, and the number of cores starts directly from 24. The first release of the third generation of the sharp dragon thread Ripper processor only has two products, including 32-core and 24 core. At the meeting, I thought that the maximum core number of AMD's thread Ripper would be the same as that of the previous generation. However, at the end of the conference, amd suddenly announced that there would be more than 32 core products. Next year, it will launch 64 core sharp dragon thread Ripper 3990x, in terms of its name, the 48 core Reebok threadriver 3980x should exist. Ruilong threadriver 3970x is a 32-core 64 thread, with a basic frequency of 3.7GHz and a maximum acceleration frequency of 4.5ghz. The price is 15299 yuan;

The sharp dragon threadriver 3960x is a 24 core 48 thread with a basic frequency of 3.8ghz and a maximum acceleration frequency of 4.5ghz. The price is 10699 yuan;

Reebok threadriver 3970x processor map

The third generation of dragon thread tearing has replaced the outer package again. Compared with the previous two generations, the packaging is much smaller than that of the previous two generations. It is no longer a foam outer package, but a relatively more environmentally-friendly thick paper skin. The interior packaging becomes more gorgeous. The first two generations are equipped with a processor with orange black boxes, and now the processor is locked in a monument container with a transparent outside. After removing the acrylic cover of Ming Dynasty, you need to open the side lock to take out the processor. Hexagon screwdriver, CPU fastener, one large and one small two ryzen threadriver stickers are located in the small box at the bottom of the package. Although the third generation sharp dragon thread Ripper changed the interface, the shape, pin position and number of the interface are not different from those before, but the pin definition is different, so the sharp dragon thread Ripper 3970x processor looks no different from the previous two generations of products, and there is an orange installation slide outside the CPU.

The change of the third generation of raptors

At the very beginning of the birth of the Raptor, it was packaged with MCM multichip, the same with the third generation of raptor, but the core architecture evolved from Zen + to Zen 2, and the core process technology upgraded from 12NM of GF to 7Nm of TSMC. Of course, the IO core is still produced with 12NM, using the new socket strx4 interface, although the shape and pin number of the interface and the previous socket are the same TR4 is exactly the same, but the definition of the pin is not used. It is not downward compatible with the x399 motherboard. It can only be used with the new trx40 motherboard. Improvement of Zen by 7Nm process 2. The 7 nm process is adopted for the production of Ruilong, and the generation factory is no longer the former girlfriend Globalfoundries, but TSMC. Moreover, unlike the previous 12 nm modified version of only 14 nm, TSMC's 7 nm process is a new node process. According to AMD's 7 nm process, the crystal tube density is doubled, the power consumption is reduced by 50% under the same energy or the performance is improved by 25% under the same power consumption. In fact, the CCX area of Zen 2 architecture with 7Nm technology is 31mm2, and the CCX area of Zen + Architecture with 12NM technology is 44mm2, the area is reduced by 29%. But don't forget that the L3 cache in each CCX of Zen 2 has doubled, and these caches occupy a considerable space. In the case of doubling the cache, the core area has also been reduced so much. It can be seen that the progress of 7Nm compared with 12NM is considerable. In the same voltage, the core frequency of the product with 7Nm technology will be 350MHz higher than that of the product with 12NM. The energy consumption of the third generation of zen2 architecture is 75% higher than that of the second generation of zen2 architecture, and 58% higher than that of the ninth generation of Intel Core Processor with 14nm + technology.

According to AMD, the improvement of Zen 2 architecture is from Zen and Zen + Architecture, which can be said to be a continuation of the latter two, but at the same time, many innovations and improvements have been made, and finally the computing and expansion capabilities have been greatly improved. Compared with Zen + Architecture, Zen 2 architecture improves IPC by 15%, doubles cache capacity, and doubles floating-point computing power. The core of Zen 2 architecture still maintains a synchronous multithreaded SMT design with one core supporting two threads, but compared with the previous generation architecture, it has a larger microinstruction cache and supports 4K instructions. Compared with Zen and Zen + architectures, L3 cache directly doubles. There are four integer units and two floating-point units in one core. The Zen 2 architecture uses a new branch predictor, which reduces the prediction error rate by 30%, and makes the processor spend less time to complete the front-end task, so it can improve the computing efficiency of the processor. Infrastructure Zen The cache system of Architecture 2 has also been further optimized. L1 instruction cache is adjusted from 64KB, 4-way to 32KB, 8-way array, L1 data cache is 32KB, 8-way array, and the bit width is 32 bits. Compared with Zen architecture, L1 data cache is doubled; L2 cache capacity is still 512KB, 8-way array, and the read ahead mechanism of L1 and L2 cache is improved; L3 cache is a shared 16MB, 16 way array, and the capacity is doubled Twice as much as before. The improvement of the instruction fetching system includes the addition of a new tag branch predictor, which will complement the neural network prediction to improve the prediction accuracy, and the branch target buffer will also change. In the previous Zen architecture, BTB has three levels, l0 BTB has 16 entries, L1 BTB has 256 entries, L2 BTB has 4K entries, and in the Zen 2 architecture, the number of l0 BTB is the same as Zen, L1 The number of btbs has doubled to 512 entries, while L2 BTB has increased by 1.75 times to 7K entries, with a large 1K indirect target array. The improvement of instruction decoding system includes optimization of operation cache, double 4K microinstruction operation cache, better instruction fusion, and increase throughput by preventing recoding operation. In terms of floating-point architecture, the current amd sharp dragon and Xiaolong processors support avx2. On Zen 2, amd doubled the floating-point unit bit width from 2x128bit to 2x256bit, greatly improving the efficiency of executing avx-256 instructions. The multiplication instruction delay was also shortened from 4 cycles to 3 cycles. The change of floating-point unit greatly improved the performance of Zen 2 processor when running creative applications. In terms of integer units, the integer scheduler of Zen 2 is increased from 84 to 92, including four 16 entry Alu arrays and one 28 entry AGU array. Each kernel has four integer Alu units and three AGU address generating units, one more than the previous Zen architecture, This enables the execution engine to extract data in memory more reliably, improves the fairness of SMT synchronous multi-threaded calls to Alu unit and AGU unit, and reduces the contention between threads for resources. The physical register heap is increased from 168 entries to 180 entries so that the CPU can access more working data in real time. Compared with Zen +, read and storage system improvement Zen 2 has a 21% improvement in single thread performance, of which 60% is from the improvement of architecture optimization IPC, and 40% is from the frequency improvement of 7Nm process.

Generally speaking, Zen 2 architecture is closer to the original imperfection of Zen and Zen + Architecture, and it has been enhanced in many aspects. By increasing double cache, it increases the hit rate of instruction prediction, increases the bandwidth of internal data and instruction transmission, and maximizes the core operation efficiency.

Improved MCM architecture

The previous two generations of sharp dragon thread Ripper are directly packaged with four Zen / Zen + cores, and the infinity of 25gbps is used between the cores Fabric bus interconnection, such an architecture has no problem on the server's epyc processor, because each core has its own memory and PCI-E controller enabled, but the Ruilong thread Ripper has core 0 and core 2 to provide memory controller and PCI-E controller, and core 1 and core 3 have no direct connection between memory and PCI-E, so ryzen tr 2970wx / 2990wx can only work in NUMA mode, and the communication delay of 1 / 3 of the core is significantly higher than that of 0 / 2 of the core, so the performance of these two cores will be limited. Moreover, the work assignment of windows is not so intelligent. If the program does not use so many threads, but unfortunately it is assigned to those two cores for processing, the work efficiency will be greatly reduced. This is the connection mode between the four cores of the second generation of raptors, and to the Zen 2 architecture. From the consumer level Raptors to the server's epyc processor, MCM encapsulation is used, but the connection method is very different from the previous epyc and raptors, Zen In the architecture, CPU is divided into CCD computing core and IOD input and output core. There are only two CCD in the CCD core, which is only responsible for computing. All the memory, PCI-E, USB and SATA controllers are transferred to IOD. The second generation infinity is used between CCD and IOD Fabric bus connection, so the delay will increase, but it can solve the problem of different access memory and PCI-E delay between each core. Moreover, with the large cache design inside the CPU and the instruction prediction mechanism of Zen 2 architecture, the problem of delay has been solved to a large extent. The third generation of sharp dragon thread Ripper currently uses one IOD to connect four CCDs. If the 48 core and 64 core products come out next year, it will be one IOD to connect eight CCDs. There are 64 PCI-E 4.0 controllers, four channel memory controllers and four USB 3.2 Gen available in the IOD USB controller with 2 interfaces, the communication bandwidth between single CCD and IOD is 51.2gb/s for reading and 25.6gb/s for writing when fclk is 1600MHZ. In addition, the two fastest cppc2 cores of the third generation Raptors are fixed on ccd4, which can be understood as the core constitution of ccd4 is the best. All Ruilong, Ruilong thread Ripper and epyc processor use the same CCD core. They are all 8-core chips made by TSMC 7Nm process. The core area of each CCD is 74mm2, with 3.9 billion transistors inside. However, the IOD used by the sharp dragon processor and epyc on the AM4 platform is different from that used by the sharp dragon thread Ripper. The IOD on the sharp dragon processor only supports two channels of memory and the largest two CCDs. It is produced by the 12NM process of GF, with a core area of 125mm2 and 2.09 billion transistors inside. The IOD on the epyc and sharp dragon thread Ripper can support up to eight CCDs and more PCI-E channels, so The chip size is much larger, the core area reaches 416mm2, and the crystal