How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition

LLM有监督微调与复合数据组成的关系

发表于 2025/12/09 更新于 2025/12/13

作者 ChishiAI

2 分钟阅读

The Problem and Motivation

Q1. [Motivation] What is task the paper is trying to solve? How does this task differ from single-task learning and its challenges? [Section: 1– Introduction]

The task trying to solve
In this study, we focus on the data composition problem among mathematical reasoning, code generation, and general human-aligning abilities in SFT.
We aim to comprehensively investigate the relationship between model performance and different factors including data amount, data composition ratio, model scales, and SFT training strategies.
Difference from the single-task learning
Existing research has mostly conducted separate SFT investigations on each of the three tasks, but in this task we investigate the versatile performance of SFT with composite task data.
Challenges
Data composition can lead to performance conflicts when data is plentiful.

Q2. [Related Work] What do the authors mean when they mention “aligning LLMs to human intent”? What kinds of datasets do people perform supervised fine-tuning on to align LLMs with human intent, and what do the data actually look like? Search up some of the referenced datasets in this section and describe 1-2 concrete examples. [Section: 2– Related Work]

Note: We are not expecting you to read all the cited work. Few sentences will be sufficient.

Proposed Method and Key Insights

Q3. [Key Insights / Contributions] The authors organize the paper around four major research questions (RQs). For each RQ listed below, briefly summarize the authors’ findings based on the Experiments. [Section: 3- Experiments]

RQ1: How do performance of individual abilities change as the amount of SFT data increases?

RQ2: How do performance change when different abilities are trained together using mixed SFT data? Why does the same strategy that work for low resource setting not effective to the setting where we have enough data?

RQ3: What primarily drives the conflicts described in RQ2: the total data size or the data ratio? Under what conditions does the ratio matter?

RQ4: How do different supervised fine-tuning strategies (multi-task, sequential, mixed sequential, DMT) influence ability trade-offs? Which works best?

Methods and Experiments

Q4. [Experimental Setup] How do the authors design their experiments to answer the four research questions? In your answer, describe: (a) the model and the model size that was trained (b) the specific datasets used to represent each ability, (c) the SFT strategies and data manipulations the authors vary. [Section: 3– Experiments]

Q5. [Proposed Approach] Describe the Dual-stage Mixed Fine-tuning (DMT) strategy proposed in the paper. How does it help mitigate the issues identified in previous RQs? [Section: 3.5– RQ4]

Limitation and Conclusion

Q6. [Open-Ended] In Section 2, the authors note that “recent initiatives have generated SFT datasets from user logs within proprietary LLM platforms,” meaning that many modern SFT datasets now contain LLM-generated responses. Answer the following:

(a) What are the potential advantages and disadvantages of using LLM-generated data (e.g., assistant responses from user logs) for supervised fine-tuning?
(b) Suppose you have two specialized datasets (e.g., math and coding) and you do not have have access to other data. How could LLM-generated data be used to help reduce catastrophic forgetting?

论文阅读, LLM

llm finetuning

本文由作者按照 CC BY 4.0 进行授权