"Training data is technology " .

That’s what OpenAI co-founder Ilya Sutskever said when taking interview with The Verge. ChatGPT amaze the world since its release. The stunning performance of GPT-4 makes us believe we have enter a new era in AI.

What makes large model so omniscient? In our opinion, the reason may lie in the data...

This article is a collection of Dr. Qingqing Zhang’s thoughts on data, large models and generative AI.

Conversation Is the Key to Human-Computer Interaction

OpenAI was founded in 2015, while Magic Data was established in 2016. For seven years, Magic Data has been focused on researching conversational data. Over the years, Magic Data has been asked why they chose to study data instead of entering more well-known AI fields such as intelligent customer service systems and autonomous driving.

Just like OpenAI, which quietly worked until it became one of the world's famous companies overnight, Magic Data believes in the power of compound interest to achieve breakthrough development.

Today, ChatGPT is helping more people understand the importance of conversational AI. Dr. Zhang Qingqing's understanding of conversational AI comes from 18 years of experience in the industry. During her time at the Chinese Academy of Sciences, she helped several large enterprises build conversational systems. Through this process, she found that selecting and processing data and cognitive coupling through the closed loop of data and models are critical to achieving the best results in artificial intelligence. Data has a direct impact on computing power and algorithms, not just on the value of the data itself.

Magic Data firmly believes that conversational AI is key to future human-machine interaction. This is why we have been focusing on this field until today.

Data Pyramid of AIGC Large Model

According to publicly available information, ChatGPT was built through pre-training and fine-tuning, while incorporating human feedback reinforcement learning mechanisms. The entire training process is a continuous coupling of human and machine. In this process, the model is continuously optimized through human feedback, with conversational question-answering as the core.

Building such a large model like ChatGPT requires three types of data.

The first type is massive unstructured data for pre-training, which does not require human intervention but has low accuracy and quality. This is because this part of the data contains too much low-quality data, and the large model has a huge number of parameters, which consumes a lot of computing power and has certain risks.

The second type is data produced through human-machine collaboration, including manually generated Q&A pairs, human quality sorting of machine-generated data, and machine-generated sorting data.

Image credit: OpenAI

The third type is the knowledge base dataset, which may not require a large amount of data but needs to be very precise and accurate. Expert knowledge data in the vertical domain will be the key to improving the quality of ChatGPT.

Dr. Zhang Qingqing believes that for building a conversational model, good data needs to meet three points. The first point is to be as natural as possible, close to the natural way of conversation between humans, rather than cold and mechanical responses. The second point is domain relevance or vertical domain knowledge correctness, which requires the intervention of expert systems. Most importantly, data security and compliance are crucial to building a safe and reliable ChatGPT.

Data Resources Depleted? - Generative AI Data Trends

How can we meet the massive training data demands of large models? According to market research, the real data available on the internet will be nearly exhausted by 2026. In future AI training, both real and generative data will be considered for use. Gartner predicts that the usage of generative data will surpass that of real data.

Image Credit: Gartner

Generative data is obtained by establishing a mathematical model or simulation environment, using a decentralized approach to collect data. These datasets can be adjusted and controlled as needed and can cover more application scenarios, helping people and machines better understand the characteristics and behavior of data. The advantage of generative data is that it can more accurately control generation conditions and meet the requirements of data compliance.

By using generative AI data, both the quantity and diversity requirements of AI training data can be met while ensuring compliance to the highest degree possible. In the future, large models such as ChatGPT will have the opportunity to use such generative AI data for training.

The Advantages and Capabilities of Magic Data - Accumulated Over 100 Million Rounds of Multi-turn Conversation Data

As an AI data solutions company, Magic Data has been focusing on building multi-turn conversation data for the past 7 years, and has accumulated over 100 million rounds (200,000 hours) of high-precision data. All of the data has been manually proofread and annotated to ensure its high quality. This data was obtained through crowdsourcing, inviting C-end users to contribute data and providing certain rewards in return.

The data is split by industry and covers topics in daily life, as well as some domain-specific knowledge.

Magic Data domain-specific multi-round conversation data, visit www.magicdatatech.com to see more

Specialization Is Necessary in Any Profession, And the Development of AI Requires Collaboration between AI Application, Compute Capability, and Data Science.

Those who develop AI models may not necessarily be data experts, but data science is crucial to the development of AI. While data science and algorithm frameworks are separate things, they are also closely intertwined. Data science relies on an understanding of how the frameworks operate, while optimizing the algorithm framework requires the support of data science. Therefore, a good AI result can only be achieved by considering various factors.

The key to implementing AI lies in its application, production, and engineering capabilities, which must be closely linked to the industry. AI algorithm practitioners may enter the field of AI engineering in the future, which would be a highly fulfilling pursuit. At the same time, the issue of AI implementation cannot be ignored. To address this, we need to work with all ecosystem partners to build a machine learning operations loop, which achieves a closed loop through data processing and model iteration. This loop is lengthy and extensive, requiring specialized knowledge at every step. We hope to work with all ecosystem partners to achieve widespread application of AI across various industries.

Better Data, Less GPUs

Today, a large model like ChatGPT consumes a lot of computing power in use and scheduling, and one training can consume millions of energy, and using scheduling is equally expensive. Considering that human energy is limited, we need to focus on how to develop AI in a more environmentally friendly way. How to balance AI development and resource consumption? Data maybe a solution. We need to make AI models leaner, not bloated. For this reason, the data feeding the model should be of high quality and not be miscellaneous or bulky. Only in this way can more compute capability be saved and the sustainability of AI development realized.

News