Selective Language

  1. English
  2. 繁体中文
  3. Беларусь
  4. Български език
  5. polski
  6. فارسی
  7. Dansk
  8. Deutsch
  9. русский
  10. Français
  11. Pilipino
  12. Suomi
  13. საქართველო
  14. 한국의
  15. Hausa
  16. Nederland
  17. Čeština
  18. Hrvatska
  19. lietuvių
  20. românesc
  21. Melayu
  22. Kongeriket
  23. Português
  24. Svenska
  25. Cрпски
  26. ภาษาไทย
  27. Türk dili
  28. Україна
  29. español
  30. עִבְרִית
  31. Magyarország
  32. Italia
  33. Indonesia
  34. Tiếng Việt
  35. हिंदी
(Click on the blank space to close)
HomeNewsOn the big AI model, how can Chinese chips ride the wind and waves?

On the big AI model, how can Chinese chips ride the wind and waves?

Jan10
In 2023, breakthroughs in large models and the rise of generative AI are leading the AI industry into a new stage of intelligent innovation, and will also trigger new changes in computing power architecture.

According to the latest "China Artificial Intelligence Computing Power Development Assessment Report 2023-2024", the scale of the global artificial intelligence hardware market (server) will grow from US$19.5 billion in 2022 to US$34.7 billion in 2026, with a five-year compound annual growth rate The rate reached 17.3%; in China, China's artificial intelligence server market is expected to reach US$9.1 billion in 2023, a year-on-year increase of 82.5%, and will reach US$13.4 billion in 2027, with a five-year compound annual growth rate of 21.8%. China's computing power market, especially the field of intelligent computing, is booming.

1、CPU+GPU becomes the main method of AI heterogeneous computing
In the era of large models, building and tuning generative AI basic models to meet application needs will bring changes and development opportunities to the entire infrastructure market. "Application-oriented and system-centered" will be the main path for future computing power upgrades.

From a technology development perspective, heterogeneous computing is still one of the chip development trends. In a single system, heterogeneous computing utilizes different types of processors (such as CPU, GPU, ASIC, FPGA, NPU, etc.) to work together to perform specific tasks to optimize performance and efficiency and utilize different types of computing resources more efficiently,to meet different computing needs. For example, by leveraging the parallel processing capabilities of the GPU,the training speed and efficiency of models, especially large models, can be improved; in stages such as data preprocessing and model tuning, the CPU can be used for calculation and decision-making, or in controlling and coordinating computing resources ( (such as GPU, FPGA, etc.)) uses the CPU during the work process to ensure the smooth progress of the calculation process; in addition, the model can be deployed on edge devices by using FPGA for inference acceleration to carry out faster real-time inference work.

IDC survey research shows that as of October 2023, the Chinese market generally believes that the "CPU + GPU" heterogeneous method is the main combination form of AI heterogeneous computing.

2、In the era of large models, three major challenges for AI chips
The increase in the demand for AI computing power has provided greater space for the development of China's local chip manufacturers and brought new opportunities. IDC predicts that China's artificial intelligence chip shipments will reach 1.335 million pieces in 2023, a year-on-year increase of 22.5%.

While facing vast opportunities, in the era of large models, my country's AI chips are also facing new development challenges.

First of all, there is a big gap with the world's leading AI chips. Taking Nvidia's latest H200 GPU as an example, its performance has reached nearly 5 times that of its A100 GPU. The large model cluster training performance of my country's AI chips is only a few close to A100/A800, and most of them are less than 50% of their performance. This also means that my country's AI chips are about 3% higher than the international leading level in terms of large model training performance. years of generational gap.

Secondly, in terms of ecology, after 17 years and a cumulative capital investment of more than 10 billion U.S. dollars, NVIDIA's CUDA has more than 3 million developers worldwide and has become a basic library with a monopoly on global AI development. In contrast, the overall market share of domestic AI chip companies does not exceed 10%, and each AI chip software is different and the ecological system is fragmented.

In addition, in the current context, my country's AI chip production capacity is blocked and key technologies for upgrading to high-end chips are restricted, which has also restricted the development of AI chips to a certain extent.

3、Solve the triple problems of heterogeneous computing power
Based on the current situation, Lin Yonghua, deputy director and chief engineer of Beijing Zhiyuan Artificial Intelligence Research Institute, proposed that in the era of large models, my country's heterogeneous computing power mainly faces three constraints.

Heterogeneous computing power constraint 1: Different computing power cannot be pooled for training
Specifically, current heterogeneous hybrid distributed training has the following challenges: the software and hardware stacks of devices with different architectures are incompatible, and numerical accuracy may also vary; it is difficult to communicate efficiently between devices with different architectures; different devices have different computing power and memory. It is difficult to perform load balancing sharding.

These challenges are difficult to solve at once. At present, Zhiyuan has tried to conduct heterogeneous training on different generations of devices with the same architecture or on different devices with compatible architectures. In the future, it will explore heterogeneous training on devices with different architectures. FlagScale is a framework that supports multi-vendor heterogeneous computing pool training. It currently implements two modes: heterogeneous pipeline parallelism and heterogeneous data parallelism.

Heterogeneous pipeline parallelism: During actual training in this mode, it can be mixed with data parallelism, tensor parallelism, and sequence parallelism to achieve efficient training. According to the memory usage characteristics of the backpropagation algorithm, this mode is suitable for placing devices with relatively large memory at the front of the parallel stage of the pipeline, and devices with small memory at the rear stage of the pipeline for parallelization, and then allocating different devices according to the computing power of the device. network layer to achieve load balancing.

Heterogeneous data parallel mode: During actual training, this mode can be mixed with tensor parallelism, pipeline parallelism, and sequence parallelism to achieve large-scale and efficient training. Devices with larger computing power and memory will handle larger micro-batch sizes, while devices with smaller computing power and memory will handle smaller micro-batch sizes, thereby achieving load balancing across different devices.

According to the experimental results of three sets of heterogeneous hybrid training on NVIDIA and Tianshu Intelligent Core clusters demonstrated by Zhiyuan, it is shown that heterogeneous hybrid training has better returns and is close to or even exceeds the performance upper limit in three configurations. This shows that heterogeneous hybrid training The efficiency loss of training is low and better training benefits are obtained.

Lin Yonghua introduced that the heterogeneous computing power pool training framework FlagScale is realizing the heterogeneous pool training of NVIDIA computing power clusters and Tianshu Zhixin computing power clusters. In the future, it will realize more heterogeneous pooling between computing power clusters of different Chinese manufacturers. training, promote the standardization of communication libraries of heterogeneous chips from different manufacturers, and achieve high-speed interoperability and interconnection.

She said that in the iterative update process of chips, there will definitely be a process of mixing new and old generation chips. She hopes to continue to work on hybrid training technology that is compatible with heterogeneous chips. She also hopes that various business resources can be flexibly combined in the same data center. , to maximize performance and efficiency.

Heterogeneous computing power constraint 2: Restricted by CUDA, it is difficult to adapt the operator library to different hardware
Currently, my country's AI chip software ecosystem is weak, and mainstream AI frameworks mainly support NVIDIA chips. Domestic AI chips need to adapt to multiple frameworks. Each time the AI framework version is upgraded, adaptation needs to be repeated. At the same time, each AI chip manufacturer has its own underlying software stack, which is incompatible with each other.

Under the demand for large models, the above problems have three major impacts: First, the operators and optimization methods required for large models are missing, resulting in the model being unable to run or running inefficiently; second, there will be differences due to chip architecture and supporting software implementation. The resulting accuracy error problem; third, to implement large model training on domestic AI chips requires a lot of transplantation work, and the cost of adaptation and migration is very high.

In this regard, Lin Yonghua believes that building a public AI chip open software ecosystem is very critical. Combined with large model research and development needs, the infrastructure level must build a next-generation open and neutral AI compiler middle layer and adapt to the PyTorch framework. Supports open source programming languages and compiler extensions. In the next step, we must continue to explore common core technologies that maximize the performance and utilization of hardware infrastructure, and optimize software and hardware collaboration for typical and complex operators to the extreme, making the results open source and efficient to support large model training.

The third constraint of heterogeneous computing power: chip architecture and software are different, making evaluation difficult and affecting implementation progress.
Currently, there are many AI chip companies with different architectures and development tool chains, and there are many AI frameworks, coupled with endless scenarios and complex and changeable models, resulting in a heavy workload of adaptation, high development complexity, and difficulty in unifying evaluation standards. It affects the implementation and large-scale application of products.

Lin Yonghua believes that the evaluation of AI heterogeneous chips is of great value to the industry ecology. Currently, the industry lacks a widely recognized, neutral, open source, and open evaluation system for heterogeneous chips. An open source AI chip evaluation project should be established, including basic environment, heterogeneous chip basic software, test sets, etc., support for model operation, chip training time and computing throughput, usage of chips and other server components, A comprehensive evaluation is conducted on the chip's ability to support different frameworks and software ecosystems.

4、Write at the end
The development of large AI models has increased the demand for intelligent computing power. IDC data shows that from 2022 to 2027, the compound annual growth rate of my country's intelligent computing power scale will reach 33.9%, surpassing the 16.6% compound annual growth rate of general computing power scale in the same period.


Local AI chip manufacturers are facing new opportunities and challenges. In view of the bottleneck problem of single-chip computing power and the training problem of multi-chip heterogeneous pools, building a computing power infrastructure platform with global thinking has become the key to the future. Especially in terms of building a software ecosystem that matches hardware, including operating systems, middleware, and tool chains, as large models move from basic research and development to application implementation, the importance and value of software infrastructure will be further highlighted.

This is also the training that the AI chip must complete as the core basic link in the process of application and large-scale implementation of "from 1 to 100" after the large model has completed the "from 0 to 1" pre-training, and will also provide China's AI chip industry has had a profound impact.


For more electronic components requirements, pls see:https://www.megasourceel.com/products.html

MegaSource Co., LTD.