InstOptima:
Evolutionary Multi-objective Instruction Optimization via Large Language Model-based Instruction Operators
InstOptima:基于大型语言模型的指令操作符的进化多目标指令优化
Abstract 摘要
Instruction-based language modeling has received significant attention in pretrained language models. However, the efficiency of instruction engineering remains low and hinders the development of instruction studies. Recent studies have focused on automating instruction generation, but they primarily aim to improve performance without considering other crucial objectives that impact instruction quality, such as instruction length and perplexity. Therefore, we propose a novel approach (i.e., InstOptima) that treats instruction generation as an evolutionary multi-objective optimization problem. In contrast to text edition-based methods, our approach utilizes a large language model (LLM) to simulate instruction operators, including mutation and crossover.
Furthermore, we introduce an objective-guided mechanism for these operators, allowing the LLM to comprehend the objectives and enhance the quality of the generated instructions. Experimental results demonstrate improved fine-tuning performance and the generation of a diverse set of high-quality instructions.
基于指令的语言建模在预训练语言模型中受到了重视。然而,指令工程的效率仍然较低,阻碍了指令研究的发展。最近的研究集中在自动化指令生成上,但主要旨在提高性能,而没有考虑到影响指令质量的其他关键目标,如指令长度和困惑度。因此,我们提出了一种新颖的方法(即InstOptima),将指令生成视为一种多目标优化问题。与基于文本编辑的方法相比,我们的方法利用一个大型语言模型(LLM)来模拟指令操作符,包括变异和交叉。此外,我们引入了一个目标导向的机制,使得LLM能够理解目标并提高生成指令的质量。实验结果表明,改进了微调性能,并生成了一组多样化的高质量指令。
1 Introduction 1介绍
With the rapid development of language models Ouyang et al. (2022); Touvron et al. (2023); OpenAI (2023), instructions (also known as prompts) play a crucial role in instruction-based language modeling, and different instructions may lead to significant differences in model outputs Zhou et al. (2022); Honovich et al. (2022); Wan et al. (2023). For instance, even slightly perturbed instructions (e.g., synonym substitutions Wang et al. (2021); Zhou et al. (2021) or adversarial attacks Wan et al. (2023); Zhu et al. (2023)) can result in unexpectedly low performance. However, there are three problems regarding instruction-based learning that still need to be addressed in existing works.
随着语言模型的快速发展(Ouyang等,2022年;Touvron等,2023年;OpenAI,2023年),指令(也称为提示)在基于指令的语言建模中起着至关重要的作用,不同的指令可能导致模型输出的显著差异(Zhou等,2022年;Honovich等,2022年;Wan等,2023年)。例如,即使是稍微扰动的指令(例如,同义词替换(Wang等,2021年;Zhou等,2021年)或对抗攻击(Wan等,2023年;Zhu等,2023年)),也可能导致意外低性能。然而,现有研究中仍然存在三个关于基于指令学习的问题需要解决。
Firstly, existing works Lester et al. (2021); Gu et al. (2022); Zhou et al. (2022, 2023); Li et al. (2023); Chen et al. (2023) aim to obtain a large number of instructions through automated instruction generation to filter high-performance instructions. However, due to the large and non-differentiable textual search space Ishibashi et al. (2023); Cho et al. (2023), the automated instruction generation and instruction engineering methods Brown et al. (2020); Liu et al. (2023) are inefficient and struggle to search for various high-quality instructions.
Secondly, the objectives of instruction generation are not clear. Current research Lester et al. (2021); Gu et al. (2022); Pitis et al. (2023) regards performance (i.e., metrics) as the sole criterion for instruction quality. However, model performance alone cannot precisely explain instruction quality. We propose to refine instruction quality by considering fine-grained objectives, such as length and perplexity. Shorter instructions can lower computational costs, especially for large-scale models and datasets. Lower perplexity indicates that instructions are more easily understood by language models.
Lastly, the diversity of instructions has been neglected in existing studies, while increasing the diversity of instructions can mitigate adversarial attacks Wan et al. (2023); Zhu et al. (2023) and improve instruction robustness Yu et al. (2022); Zhu et al. (2023). We aim to obtain multiple alternative instructions based on multi-objective optimization, which can facilitate comprehensive evaluation of instructions.
首先,现有的研究作品Lester等人(2021年);Gu等人(2022年);Zhou等人(2022年,2023年);Li等人(2023年);Chen等人(2023年)旨在通过自动化指令生成获得大量指令,以过滤高性能指令。然而,由于文本搜索空间庞大且不可微分的问题Ishibashi等人(2023年);Cho等人(2023年),自动化指令生成和指令工程方法Brown等人(2020年);Liu等人(2023年)效率低下,难以搜索各种高质量的指令。其次,指令生成的目标不明确。当前的研究Lester等人(2021年);Gu等人(2022年);Pitis等人(2023年)将性能(即指标)视为指令质量的唯一标准。然而,仅凭模型性能无法准确解释指令质量。我们提议通过考虑细粒度目标(如长度和困惑度)来改进指令质量。较短的指令可以降低计算成本,特别是对于大规模模型和数据集。较低的困惑度表示语言模型更容易理解指令。 最后,现有研究忽视了指令的多样性,而增加指令的多样性可以缓解对抗性攻击 Wan等人(2023);Zhu等人(2023),并提高指令的鲁棒性 Yu等人(2022);Zhu等人(2023)。我们的目标是基于多目标优化获得多个备选指令,这可以促进对指令的全面评估。
To address these three problems, we formulate the task as an evolutionary multi-objective optimization problem and propose our framework called InstOptima. We leverage a large language model, specifically ChatGPT OpenAI (2023), to facilitate instruction operations such as mutation and crossover. Furthermore, we introduce an objective-guided mechanism to assist the language model in generating high-quality instructions. In terms of optimization objectives for instruction generation, InstOptima incorporates three objectives: performance (metrics), length, and perplexity, enabling the exploration of a diverse and high-quality set of instructions. We adopt NSGA-II Deb et al. (2002) in InstOptima to obtain a Pareto front of instruction sets.
为了解决这三个问题,我们将任务定义为一个进化多目标优化问题,并提出了我们的框架InstOptima。我们利用一个大型语言模型,具体来说是ChatGPT OpenAI(2023年版),来促进指令操作,如变异和交叉。此外,我们引入了一个目标引导机制,以帮助语言模型生成高质量的指令。在指令生成的优化目标方面,InstOptima包括三个目标:性能(指标),长度和困惑度,从而实现对多样化和高质量指令集的探索。我们在InstOptima中采用了NSGA-II Deb等人(2002年)来获得一组帕累托前沿的指令集。
To validate the efficacy of InstOptima, we conducted experiments on three generation-based classification tasks. The experimental results indicate that InstOptima can concurrently obtain a diverse set of instructions that outperform the counterparts regarding performance.
为了验证InstOptima的有效性,我们在三个基于生成的分类任务上进行了实验。实验结果表明,InstOptima能够同时获得一组多样化的指令,其性能优于其他对比方法。
In summary, our contributions are as follows:
总结一下,我们的贡献如下:
-
•
We simulate instruction operators based on an LLM. We also show that the objective-guided operators help the LLM understand optimization objective values and improve instruction quality.
我们基于LLM模拟指令操作符。我们还展示了目标导向的操作符有助于LLM理解优化目标值并提高指令质量。 -
•
We divide the orientation of instruction search into multiple objectives, such as performance, length, and perplexity, facilitating fine-grained control over instruction quality.
我们将指导搜索的方向分为多个目标,例如性能、长度和困惑度,以便对指导质量进行细粒度控制。 -
•
We utilize a multi-objective optimization algorithm to automatically search for a set of high-quality instructions, which could benefit defending against adversarial attacks and improving instruction robustness.
我们利用多目标优化算法自动搜索一组高质量的指令,这有助于抵御对抗性攻击并提高指令的稳定性。
The codes are available at: https://github.com/yangheng95/InstOptima.
代码可在以下网址获取:https://github.com/yangheng95/InstOptima。

图1:InstOptima的主要框架(左)和指令操作示例(右)。工作流程的详细信息在第2.2节中解释。人口由指令示例的个体组成。
2 Proposed Method 2提议的方法
In this section, we first introduce the instruction-based text generation, followed by the details of InstOptima.
在本节中,我们首先介绍基于指令的文本生成,然后详细介绍InstOptima。
2.1 Instruction-based Generation
2.1基于指令的生成
In text generation-based tasks111We validate InstOptima generation-based text classification, and InstOptima can be easily applied to other instruction-based modeling tasks.
我们验证了InstOptima基于生成的文本分类,并且InstOptima可以轻松应用于其他基于指令的建模任务。
在基于文本生成的任务中 1 , instructions are utilized to facilitate in-context learning Brown et al. (2020) and improve language modeling. An instruction (depicted in the right part of Fig. 1) is represented as , where and are the definition and example of the target task, respectively. and are token sequences similar to , where , , and denote the input, output, and task dataset, respectively. The modeling of a generation model is defined as follows:
,指令被用于促进上下文学习(Brown等人,2020年)和改善语言建模。指令(如图1右侧所示)表示为 ,其中 和 分别是目标任务的定义和示例。 和 是类似于 的令牌序列,其中 , 和 分别表示输入、输出和任务数据集。生成模型 的建模定义如下:
(1) |
where represents the generated output given and . In InstOptima, we aim to address the problem of automated instruction generation through multi-objective optimization.
在InstOptima中,我们旨在通过多目标优化来解决自动指令生成的问题,其中 表示给定 和 的生成输出。
2.2 Evolutionary Instruction Optimization
2.2进化指令优化
The workflow of InstOptima is illustrated in Fig. 1. We begin by initializing a parent population of instructions to start evolving. The parent population is manipulated by LLM-based operators to generate offspring. Subsequently, we employ the non-dominated sort algorithm to rank the combined population and measure the crowdness of instructions. At the end of each generation, we randomly replace some Pareto-front instructions with new instructions to enhance the diversity of the population (referred to as genes in NSGA-II). We also provide the pseudo code of the InstOptima in Appendix A.4.
InstOptima的工作流程如图1所示。我们首先初始化一个父代指令种群以开始进化。父代种群通过基于LLM的操作符进行操作以生成后代。随后,我们采用非支配排序算法对合并种群进行排名,并测量指令的拥挤度。在每一代的结束时,我们随机替换一些帕累托前沿指令,以增强种群的多样性(在NSGA-II中称为基因)。我们还在附录A.4中提供了InstOptima的伪代码。
2.2.1 Operators for Instructions
2.2.1指令的运算符
To handle the non-differentiable text search space, we formulate these operators as a text generation task based on ChatGPT. In other words, we define a set of fixed prompts , , to guide ChatGPT in performing the instructions, where are the fixed prompts for the four operations:
为了处理不可微分的文本搜索空间,我们将这些运算符定义为基于ChatGPT的文本生成任务。换句话说,我们定义了一组固定的提示 , ,以指导ChatGPT执行指令,其中 是四个操作的固定提示。
-
•
Definition Mutation : This operator mutates the definition in an instruction. It can involve paraphrases and substitution of new definitions.
定义突变:该操作符在指令中突变定义。它可以涉及释义和替换新的定义。 -
•
Definition Crossover : This operator combines the definitions of two instructions to create a new instruction. It can involve merging or exchanging parts of the definitions between the parent instructions.
定义交叉:该操作符将两个指令的定义结合起来,创建一个新的指令。它可以涉及将父指令之间的定义部分合并或交换。 -
•
Example Mutation : This operator perturbs the example to introduce diversity. It can involve modifications such as example substitution, addition, or deletion.
示例变异 :该操作符扰动示例以引入多样性。它可以涉及示例替换、添加或删除等修改。 -
•
Example Crossover : This operator randomly selects examples from two instructions to create a new instruction.
示例交叉: 该运算符随机选择两个指令中的示例,以创建一个新的指令。
For instance, we formulate the mutation operation as follows:
例如,我们将变异操作定义如下:
(2) |
where is the new definition generated based on the original instruction . The new instruction is denoted as , . The other operators follow a similar formulation to mutation.
Further details of the fixed prompts are available in Appendix A.5.
其中 是基于原始指令 生成的新定义。新指令表示为 , 。其他运算符遵循类似的变异公式。有关固定提示的更多详细信息,请参见附录A.5。
2.2.2 Optimization Objectives
2.2.2 优化目标
We consider three objectives , in optimization, i.e., metrics (), length (), and perplexity () of the instruction.
我们在优化中考虑三个目标,即指标( ),长度( )和指令的困惑度( )。
-
•
Performance: We use a set of metrics, such as accuracy, score, precision, and recall, obtained by evaluating the instruction to calculate the performance objective. The performance objective is represented as the reciprocal of the sum of these metrics.
-
•
Length: The length of the instruction is measured in terms of the number of characters. This measurement is fair regardless of the tokenization strategy.
-
•
Perplexity: The perplexity of the instruction is measured using the RoBERTa model.
The evaluation of objectives is shown in the pseudo-code in Appendix A.4 but not depicted in Fig. 1 for simplicity.
2.3 Objective-Guided Instruction Operators
To enhance the performance of ChatGPT through in-context learning, we propose a simple yet effective objective-feedback mechanism. Specifically, we incorporate the fitness values into the fixed prompts. For example, we can append “Please refer to the objective values: , ” to in instruction examples crossover. These operators222Please refer to Table 4 for the actual implementations of these objective-guided operators. allow ChatGPT to autonomously decide to emphasize or down-weight an instruction based on the current objectives .
Model | Dataset | InstOptima | RanInstruct | NoInstruct | ||||
Accuracy | Length | Perplexity | Accuracy | Length | Perplexity | Accuracy | ||
FlanT5-small | Laptop14 | 84.9 | 622.6 | 1.07 | 82.5 | 740.2 | 1.07 | 53.8 |
Restaurant14 | 84.9 | 421.6 | 1.11 | 82.3 | 328.5 | 1.15 | 19.2 | |
SST2 | 89.7 | 402.7 | 1.09 | 88.7 | 499.7 | 1.16 | 86.9 | |
AGNews | 90.2 | 452.5 | 1.11 | 82.9 | 560.6 | 1.12 | 74.3 | |
SNLI | 69.1 | 295.3 | 1.14 | 50.8 | 507.3 | 1.09 | 37.9 | |
MNLI | 57.4 | 385.8 | 1.12 | 40.6 | 519.7 | 1.09 | 37.3 | |
FlanT5-base | Laptop14 | 88.4 | 207.2 | 1.04 | 86.6 | 549.7 | 1.10 | 62.3 |
Restaurant14 | 89.1 | 359.4 | 1.06 | 87.4 | 589.3 | 1.11 | 52.8 | |
SST2 | 94.5 | 397.8 | 1.08 | 93.0 | 385.6 | 1.12 | 92.6 | |
AGNews | 93.5 | 300.1 | 1.15 | 90.1 | 485.4 | 1.16 | 88.1 | |
SNLI | 86.6 | 430.9 | 1.10 | 86.4 | 399.3 | 1.11 | 85.9 | |
MNLI | 80.2 | 388.2 | 1.11 | 77.8 | 449.1 | 1.20 | 74.5 | |
ChatGPT | Laptop14 | 83.2 | 512.9 | 1.08 | 83.1 | 877.6 | 1.05 | 67.8 |
Restaurant14 | 96.3 | 487.3 | 1.09 | 92.1 | 421.6 | 1.10 | 75.2 |
3 Experimental Setup
We conducted a comprehensive set of experiments333To improve the reproducibility, we release all experimental materials in the supplementary files of the submission, including source code, experiment logs, and results, optimized instructions. to validate the performance of InstOptima. The detailed experiments setups and implementations are described in Appendix A.1.
3.1 Baseline Methods
We used random instruction (RanInstruct) generation (i.e., request ChatGPT generates several instructions similar to instructions generated by InstOptima) and no-instruction (NoInstruct) as comparison baselines. The RanInstruct generates five random instructions using the LLM to evaluate the same three objectives as InstOptima. The NoInstructablates instruction in the classification-oriented fine-tuning of Flan-T5.
3.2 Main Results
The results in Table 1 show the performance of InstOptima. Overall, InstOptima achieves superior objectives based on various base models (e.g., ChatGPT and FlanT5). For example, it outperforms all baselines on all datasets in terms of Accuracy. However, for instruction Length and Perplexity, the RanInstruct sometimes achieves better objective values. On the other hand, NoInstruct performs poorly on all datasets in terms of Accuracy, underscoring the importance of instructions in generation-based fine-tuning. Moreover, the Accuracy objective exhibits small intervals but relatively large variances, making it more challenging to optimize. However, existing methods that prioritize performance optimization struggle to handle the variances in metrics. On the other hand, the Length objective is easier to optimize due to its significant variations and greater significance. This is because long instructions can result in up to twice training times than short instructions. The Perplexity metric ranges within small intervals, indicating a moderate optimization challenge, but it significantly impacts the understanding of instruction engineers. In addition to these three objectives, InstOptima can easily accommodate additional objectives for precise control of instruction generation.
Overall, InstOptima demonstrates impressive performance in instruction optimization across various tasks and datasets.
3.3 Research Questions
We further discuss our observations and analysis by answering several research questions.
RQ1: Do the objective-guided operators help instruction optimization?
Dataset | InstOptima-N | ||
Accuracy | Length | Perplexity | |
Laptop14 | 84.4 | 789.3 | 1.07 |
Restaurant14 | 83.7 | 455.8 | 1.12 |
SST2 | 89.6 | 435.2 | 1.12 |
AGNews | 86.7 | 535.8 | 1.26 |
SNLI | 69.8 | 454.0 | 1.11 |
MNLI | 57.3 | 465.6 | 1.09 |
To investigate the impact of objective-guided operators on InstOptima, we conducted ablative experiments to assess the performance of InstOptima-N, which eliminates the objective guidance in the operators. The experimental results on FlanT5-small are presented in Table 2. Based on the results in Table 1 and Table 2, it is evident that InstOptima-N achieves inferior objective values on most datasets, particularly in terms of Accuracy and Length. However, for the SNLI dataset, InstOptima-N obtains better results in Accuracy and Perplexity compared to InstOptima. These findings demonstrate the effectiveness of objective-guided operators. Nonetheless, the concept of objective-guided operators is still in its early stages and warrants further investigation in future studies.
In conclusion, the experimental results indicate that objective-guided operators obtain better performance across various datasets.
RQ2: Does the number of evolution generations matter in InstOptima?

Generally, a larger number of generations tends to result in better objective values after optimization. We conducted additional training for generations on the Laptop14, SST2, and SNLI datasets to study the significance of number of generations. Based on the experimental results in Fig. 2., in most cases (e.g., Laptop14 and SNLI datasets), we observed a significant trade-off among the three objectives. However, due to the small scale of the evaluation data and population size, there were large variances in the performance objective (see the left column in Fig. 2). These variances in performance interfere with the convergence of the other two objectives, resulting in the absence of clear descending trends for the length and perplexity objectives with an increase in generations. However, this issue can be addressed by increasing the population size, number of generations, and scale of training data.
In conclusion, given the limited evaluation resources, the number of evolution generations showed limited improvement. Instead, it is important to reconcile different objective values to achieve the final instruction population.
RQ3: Are there trade-offs between different objectives?
To analyze the relationship between different objectives, we plot the Pareto front (refer to Fig. 5) of instructions into three groups. The two-dimensional Pareto fronts between pairwise objectives are presented in Fig. 3.

Overall, there is a clear trade-off between instruction length and perplexity. However, when considering the pairs of performance-length and performance-perplexity, there is no clear trade-off observed in Fig. 3. This could be attributed to the lack of strict trade-offs and the presence of noise fitness points due to the evaluation of metrics on small datasets during optimization. It is expected that this issue can be mitigated when evaluating performance on larger datasets.
Nevertheless, InstOptima consistently discovers high-quality instructions in most scenarios, regardless of the loose trade-offs between objective pairs such as performance-length and performance-perplexity. This demonstrates the effectiveness of InstOptima in obtaining a diverse set of instructions.
4 Conclusion
We propose a multi-objective instruction optimization framework to obtain a diversified set of instructions. To address the challenges posed by the large and non-differentiable text search space, InstOptima utilizes objective-guided instruction operators based on LLM, which shows impressive performance in instruction generation. However, it is important to note that multi-objective instruction optimization is still in the early stages and requires further research in the future.
5 Limitation
The first limitation of InstOptima lies in the potential crisis of local optima in the multi-objective optimization. InstOptima initializes the instruction population based on fixed manually crafted instructions, which are then mutated using LLM. Although InstOptima has been demonstrated to search for diversified and high-quality instructions in experiments, the essence on fixed initial instructions may lead to traps in local optima during the multi-objective process. In the future, the generation of initial instruction populations, such as employing randomized initial instructions, remains a topic worth exploring.
The second limitation of InstOptima is related to experimental resources. Due to resource constraints, we only utilized single-round API calls to generate new instructions using LLM. This approach overlooks the contextual information that could help in understanding objective feedback in the instruction generation. We believe that continuous dialogue with LLM will significantly improve the quality of instruction generated by LLM. Additionally, due to the difficulty of accessing LLM, we conducted experiments with smaller population sizes and fewer iterations, which may underestimate the performance of InstOptima.
Acknowledgment
This work was supported in part by the UKRI Future Leaders Fellowship under Grant MR/S017062/1 and MR/X011135/1; in part by NSFC under Grant 62376056 and 62076056; in part by the Royal Society under Grant IES/R2/212077; in part by the EPSRC under Grant 2404317; in part by the Kan Tong Po Fellowship (KTP\R1\231017); and in part by the Amazon Research Award and Alan Turing Fellowship.
References
- Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 632–642. The Association for Computational Linguistics.
- Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In NeurIPS’20: Proc. of Annual Conference on Neural Information Processing Systems.
- Chen et al. (2023) Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou. 2023. Instructzero: Efficient instruction optimization for black-box large language models. CoRR, abs/2306.03082.
- Cho et al. (2023) Sukmin Cho, Soyeong Jeong, Jeongyeon Seo, and Jong C. Park. 2023. Discrete prompt optimization via constrained generation for zero-shot re-ranker. CoRR, abs/2305.13729.
- Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
- Deb et al. (2002) Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and T. Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput., 6(2):182–197.
- Gu et al. (2022) Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2022. PPT: pre-trained prompt tuning for few-shot learning. In ACL’22: Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, pages 8410–8423. Association for Computational Linguistics.
- Honovich et al. (2022) Or Honovich, Uri Shaham, Samuel R. Bowman, and Omer Levy. 2022. Instruction induction: From few examples to natural language task descriptions. CoRR, abs/2205.10782.
- Ishibashi et al. (2023) Yoichi Ishibashi, Danushka Bollegala, Katsuhito Sudoh, and Satoshi Nakamura. 2023. Evaluating the robustness of discrete prompts. In EACL’23: Proc. of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2365–2376. Association for Computational Linguistics.
- Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 3045–3059. Association for Computational Linguistics.
- Li et al. (2023) Moxin Li, Wenjie Wang, Fuli Feng, Jizhi Zhang, and Tat-Seng Chua. 2023. Robust instruction optimization for large language models with distribution shifts. CoRR, abs/2305.13954.
- Liu et al. (2023) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9):195:1–195:35.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- OpenAI (2023) OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In NeurIPS.
- Pitis et al. (2023) Silviu Pitis, Michael R. Zhang, Andrew Wang, and Jimmy Ba. 2023. Boosted prompt ensembles for large language models. CoRR, abs/2304.05970.
- Pontiki et al. (2014) Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. Semeval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation, SemEval@COLING 2014, Dublin, Ireland, August 23-24, 2014, pages 27–35. The Association for Computer Linguistics.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1631–1642. ACL.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- Wan et al. (2023) Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning language models during instruction tuning. CoRR, abs/2305.00944.
- Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Wang et al. (2021) Xiaosen Wang, Yichen Yang, Yihe Deng, and Kun He. 2021. Adversarial training with fast gradient projection method against synonym substitution based text attacks. In AAAI’21: Proc. of Thirty-Fifth AAAI Conference on Artificial Intelligence, pages 13997–14005. AAAI Press.
- Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20, 2020, pages 38–45. Association for Computational Linguistics.
- Yu et al. (2022) Xiaoyan Yu, Qilei Yin, Zhixin Shi, and Yuru Ma. 2022. Improving the semantic consistency of textual adversarial attacks via prompt. In International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, July 18-23, 2022, pages 1–8. IEEE.
- Zhang et al. (2015) Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 649–657.
- Zhou et al. (2021) Yi Zhou, Xiaoqing Zheng, Cho-Jui Hsieh, Kai-Wei Chang, and Xuanjing Huang. 2021. Defense against synonym substitution-based adversarial attacks via dirichlet neighborhood ensemble. In ACL/IJCNLP’21: Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 5482–5492. Association for Computational Linguistics.
- Zhou et al. (2022) Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt engineers. CoRR, abs/2211.01910.
- Zhou et al. (2023) Yuhang Zhou, Suraj Maharjan, and Beiye Liu. 2023. Scalable prompt generation for semi-supervised learning with language models. In EACL’23: Findings of the Association for Computational Linguistics, pages 758–769. Association for Computational Linguistics.
- Zhu et al. (2023) Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Neil Zhenqiang Gong, Yue Zhang, and Xing Xie. 2023. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. CoRR, abs/2306.04528.
Appendix A Appendix
A.1 Experiment Setup
A.1.1 Datasets
We selected six datasets for three classification tasks. For the aspect-based sentiment analysis (ABSA) task, we used the Laptop14 and Restaurant14 datasets Pontiki et al. (2014). For text classification (TC) tasks, we chose the SST2 Socher et al. (2013) and AGNews Zhang et al. (2015) datasets. We selected the SNLI Bowman et al. (2015) and MNLI Wang et al. (2019) datasets for the natural language inference (NLI) task. We trained our models on the first samples from the original training, validation and testing datasets, respectively.
A.1.2 Experimental PLMs
For the LLM to operate instructions, we select the ChatGPT444ChatGPT-turbo-0301 version.OpenAI (2023) with a temperature of and a maximum token length of .
To obtain the objective value of performance, we performed instruction-based classification experiments using the FlanT5-small and FlanT5-base models Chung et al. (2022), as well as ChatGPT, which are the latest and popular PLM/LLM for instruction learning. For the calculation of semantic complexity, we employed the RoBERTa Liu et al. (2019) model from transformersWolf et al. (2020).
A.1.3 Hyper-parameter Settings
The generation size and number of generation for NSGA-II is and , respectively. In the fine-tuning555We use the Huggingface Trainer for fine-tuning, and the code is available in the supplementary materials. of the PLMs (i.e., FlanT5-small and FlanT5-base), we set the learning rate and batch size to and , respectively. We fine-tune the PLMs for epochs with an regularization parameter of .
A.1.4 Experimental Environment
The experiments are carried out on a computer running the Cent OS operating system, equipped with an RTX GPU and a Core i- processor. We use the PyTorch library and transformers .
A.2 Additional Experiments for Summarization
A.2.1 Generative Text Summarization
We conducted experiments for a text generation task. i.e., generative summarization. To evaluate InstOptima, we used three subsets from The GigaWord dataset and the FlanT5-small model in our experiments. In these subsets, the training set contains 5k training examples, while the testing set and validation set each have 1k examples. According to the Rouge metric, it is evident that InstOptima performs well on the GigaWord dataset, demonstrating that it is a task-agnostic method for multi-objective instruction optimization.
Model | Dataset | InstOptima | RanInstruct | NoInstruct | ||||
Accuracy | Length | Perplexity | Accuracy | Length | Perplexity | Accuracy | ||
FlanT5-small | GigaWord | 33.7 | 586.9 | 1.08 | 32.9 | 891.6 | 1.11 | 30.8 |
A.2.2 Experiments based on Different Backbone Models
We have conducted experiments to demonstrate the relationship between the backbone model and performance. Due to resource limitations, we are currently using FlanT5 variants (small, base, and large, Llama is not implemented currently) as backbones to implement InstOptima. We have generated a box plot to visualize the experimental results in Fig. 4

The figure illustrates that performance is highly dependent on the scale of the backbone instruction-follow model. In other words, because the FlanT5-small model has limited capability to follow instructions, the accuracy achieved by an instruction is low and exhibits a larger variance compared to the larger instruction-follow models. In this context, InstOptima plays a crucial role in identifying instructions with optimized objectives.
A.3 The Visualization of Pareto-fronts
In Fig. 5, we show the visualizations of Pareto-front instructions obtained by InstOptima on the Laptop14, SST2 and SNLI datasets. Due to resource limitations, we only present the plots on the Laptop14, SST2, and SNLI datasets. We plot the first three fronts searched by NSGA-II, and the first three fronts are indicated by red, green, and blue colors, respectively.

A.4 Multi-objective Optimization Algorithm
InstOptima is a multi-objective instruction optimization approach that evolves a population of instructions through a series of steps. We present the pseudo-code of InstOptima in Algorithm 1.
Firstly, the algorithm initializes a population of instructions. Then, it iteratively performs the following steps for a specified number of generations: selecting two instructions from the population, evaluating their objectives, applying LLM-based instruction operators to create new instructions, and adding them to a temporary population. After each generation, the temporary population is combined with the original population, and a selection process is applied to choose the fittest instructions. Finally, the algorithm returns the evolved population of instructions as the final results.
A.5 Fixed Prompts for Instruction Operators
The prompts in green are the trigger of objective-guided instruction generation.
Operators | Prompts | Input |
I want you to be a professional prompt engineer. Now I am working on the multi-objective evolutionary prompt optimization, and I need your help to design and optimize the template prompt. Here I give you an example template prompt, please understand the meaning of the prompt and modify it. Given the minimization objectives, please be creative and output the paraphrased or mutated prompt. Please remove Minimization objectives in the output: <Input> |
() | |
I want you to be a professional prompt engineer. Now I am working on the multi-objective evolutionary prompt optimization for sentiment analysis, and I need your help to design and optimize the template prompt. Here I give you two template prompts, please understand the meaning of the two prompts and crossover them into a new prompt. Given the minimization objectives, please be creative and output the generated new prompt based on the two examples. Please remove Minimization objectives in the output: <Input> |
(, ) | |
I want you to be a professional prompt engineer. Now I am working on the multi-objective evolutionary prompt optimization for sentiment analysis, and I need your help to design and optimize the template prompt. Here I give you two groups of examples for completing the prompt, please generate new examples to substitute the following examples and there are no more than two examples in the new prompt. Given the minimization objectives, please be creative and output the generated example in the same format. Please remove Minimization objectives in the output: <Input> |
() | |
I want you to be a professional prompt engineer. Now I am working on the multi-objective evolutionary prompt optimization for sentiment analysis, and I need your help to design and optimize the template prompt. Here I give you two groups of examples for completing the prompt, please read the examples of the two groups of examples and crossover the examples into a new example group and there are no more than two examples in the new examples. Given the minimization objectives, please be creative and output the crossovered the examples. Please remove Minimization objectives in the output: <Input> |
(, ) |