这是用户在 2024-3-11 12:59 为 https://ar5iv.labs.arxiv.org/html/2310.17630?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

InstOptima: Evolutionary Multi-objective Instruction Optimization via Large Language Model-based Instruction Operators
InstOptima:基于大型语言模型的指令操作符的进化多目标指令优化

Heng Yang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Ke Li11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
衡阳,柯利

11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTDepartment of Computer Science, University of Exeter, EX4 4QF, Exeter, UK
英国埃克塞特大学计算机科学系,邮编EX4 4QF,英国埃克塞特市

{hy345, k.li}@exeter.ac.uk
{hy345, k.li}@exeter.ac.uk 请注意,这是一个电子邮件地址。

Abstract 摘要

Instruction-based language modeling has received significant attention in pretrained language models. However, the efficiency of instruction engineering remains low and hinders the development of instruction studies. Recent studies have focused on automating instruction generation, but they primarily aim to improve performance without considering other crucial objectives that impact instruction quality, such as instruction length and perplexity. Therefore, we propose a novel approach (i.e., InstOptima) that treats instruction generation as an evolutionary multi-objective optimization problem. In contrast to text edition-based methods, our approach utilizes a large language model (LLM) to simulate instruction operators, including mutation and crossover. Furthermore, we introduce an objective-guided mechanism for these operators, allowing the LLM to comprehend the objectives and enhance the quality of the generated instructions. Experimental results demonstrate improved fine-tuning performance and the generation of a diverse set of high-quality instructions.
基于指令的语言建模在预训练语言模型中受到了重视。然而,指令工程的效率仍然较低,阻碍了指令研究的发展。最近的研究集中在自动化指令生成上,但主要旨在提高性能,而没有考虑到影响指令质量的其他关键目标,如指令长度和困惑度。因此,我们提出了一种新颖的方法(即InstOptima),将指令生成视为一种多目标优化问题。与基于文本编辑的方法相比,我们的方法利用一个大型语言模型(LLM)来模拟指令操作符,包括变异和交叉。此外,我们引入了一个目标导向的机制,使得LLM能够理解目标并提高生成指令的质量。实验结果表明,改进了微调性能,并生成了一组多样化的高质量指令。

1 Introduction 1介绍

With the rapid development of language models Ouyang et al. (2022); Touvron et al. (2023); OpenAI (2023), instructions (also known as prompts) play a crucial role in instruction-based language modeling, and different instructions may lead to significant differences in model outputs Zhou et al. (2022); Honovich et al. (2022); Wan et al. (2023). For instance, even slightly perturbed instructions (e.g., synonym substitutions Wang et al. (2021); Zhou et al. (2021) or adversarial attacks Wan et al. (2023); Zhu et al. (2023)) can result in unexpectedly low performance. However, there are three problems regarding instruction-based learning that still need to be addressed in existing works.
随着语言模型的快速发展(Ouyang等,2022年;Touvron等,2023年;OpenAI,2023年),指令(也称为提示)在基于指令的语言建模中起着至关重要的作用,不同的指令可能导致模型输出的显著差异(Zhou等,2022年;Honovich等,2022年;Wan等,2023年)。例如,即使是稍微扰动的指令(例如,同义词替换(Wang等,2021年;Zhou等,2021年)或对抗攻击(Wan等,2023年;Zhu等,2023年)),也可能导致意外低性能。然而,现有研究中仍然存在三个关于基于指令学习的问题需要解决。

Firstly, existing works Lester et al. (2021); Gu et al. (2022); Zhou et al. (2022, 2023); Li et al. (2023); Chen et al. (2023) aim to obtain a large number of instructions through automated instruction generation to filter high-performance instructions. However, due to the large and non-differentiable textual search space Ishibashi et al. (2023); Cho et al. (2023), the automated instruction generation and instruction engineering methods Brown et al. (2020); Liu et al. (2023) are inefficient and struggle to search for various high-quality instructions. Secondly, the objectives of instruction generation are not clear. Current research Lester et al. (2021); Gu et al. (2022); Pitis et al. (2023) regards performance (i.e., metrics) as the sole criterion for instruction quality. However, model performance alone cannot precisely explain instruction quality. We propose to refine instruction quality by considering fine-grained objectives, such as length and perplexity. Shorter instructions can lower computational costs, especially for large-scale models and datasets. Lower perplexity indicates that instructions are more easily understood by language models. Lastly, the diversity of instructions has been neglected in existing studies, while increasing the diversity of instructions can mitigate adversarial attacks Wan et al. (2023); Zhu et al. (2023) and improve instruction robustness Yu et al. (2022); Zhu et al. (2023). We aim to obtain multiple alternative instructions based on multi-objective optimization, which can facilitate comprehensive evaluation of instructions.
首先,现有的研究作品Lester等人(2021年);Gu等人(2022年);Zhou等人(2022年,2023年);Li等人(2023年);Chen等人(2023年)旨在通过自动化指令生成获得大量指令,以过滤高性能指令。然而,由于文本搜索空间庞大且不可微分的问题Ishibashi等人(2023年);Cho等人(2023年),自动化指令生成和指令工程方法Brown等人(2020年);Liu等人(2023年)效率低下,难以搜索各种高质量的指令。其次,指令生成的目标不明确。当前的研究Lester等人(2021年);Gu等人(2022年);Pitis等人(2023年)将性能(即指标)视为指令质量的唯一标准。然而,仅凭模型性能无法准确解释指令质量。我们提议通过考虑细粒度目标(如长度和困惑度)来改进指令质量。较短的指令可以降低计算成本,特别是对于大规模模型和数据集。较低的困惑度表示语言模型更容易理解指令。 最后,现有研究忽视了指令的多样性,而增加指令的多样性可以缓解对抗性攻击 Wan等人(2023);Zhu等人(2023),并提高指令的鲁棒性 Yu等人(2022);Zhu等人(2023)。我们的目标是基于多目标优化获得多个备选指令,这可以促进对指令的全面评估。

To address these three problems, we formulate the task as an evolutionary multi-objective optimization problem and propose our framework called InstOptima. We leverage a large language model, specifically ChatGPT OpenAI (2023), to facilitate instruction operations such as mutation and crossover. Furthermore, we introduce an objective-guided mechanism to assist the language model in generating high-quality instructions. In terms of optimization objectives for instruction generation, InstOptima incorporates three objectives: performance (metrics), length, and perplexity, enabling the exploration of a diverse and high-quality set of instructions. We adopt NSGA-II Deb et al. (2002) in InstOptima to obtain a Pareto front of instruction sets.
为了解决这三个问题,我们将任务定义为一个进化多目标优化问题,并提出了我们的框架InstOptima。我们利用一个大型语言模型,具体来说是ChatGPT OpenAI(2023年版),来促进指令操作,如变异和交叉。此外,我们引入了一个目标引导机制,以帮助语言模型生成高质量的指令。在指令生成的优化目标方面,InstOptima包括三个目标:性能(指标),长度和困惑度,从而实现对多样化和高质量指令集的探索。我们在InstOptima中采用了NSGA-II Deb等人(2002年)来获得一组帕累托前沿的指令集。

To validate the efficacy of InstOptima, we conducted experiments on three generation-based classification tasks. The experimental results indicate that InstOptima can concurrently obtain a diverse set of instructions that outperform the counterparts regarding performance.
为了验证InstOptima的有效性,我们在三个基于生成的分类任务上进行了实验。实验结果表明,InstOptima能够同时获得一组多样化的指令,其性能优于其他对比方法。

In summary, our contributions are as follows:
总结一下,我们的贡献如下:

  • We simulate instruction operators based on an LLM. We also show that the objective-guided operators help the LLM understand optimization objective values and improve instruction quality.


    我们基于LLM模拟指令操作符。我们还展示了目标导向的操作符有助于LLM理解优化目标值并提高指令质量。
  • We divide the orientation of instruction search into multiple objectives, such as performance, length, and perplexity, facilitating fine-grained control over instruction quality.


    我们将指导搜索的方向分为多个目标,例如性能、长度和困惑度,以便对指导质量进行细粒度控制。
  • We utilize a multi-objective optimization algorithm to automatically search for a set of high-quality instructions, which could benefit defending against adversarial attacks and improving instruction robustness.


    我们利用多目标优化算法自动搜索一组高质量的指令,这有助于抵御对抗性攻击并提高指令的稳定性。

The codes are available at: https://github.com/yangheng95/InstOptima.
代码可在以下网址获取:https://github.com/yangheng95/InstOptima。

Refer to caption
Figure 1: The main framework of InstOptima (left) and instruction operation examples (right). The details of the workflow that is explained in Section 2.2. The population is composed of individuals of instruction examples.
图1:InstOptima的主要框架(左)和指令操作示例(右)。工作流程的详细信息在第2.2节中解释。人口由指令示例的个体组成。

2 Proposed Method 2提议的方法

In this section, we first introduce the instruction-based text generation, followed by the details of InstOptima.
在本节中,我们首先介绍基于指令的文本生成,然后详细介绍InstOptima。

2.1 Instruction-based Generation
2.1基于指令的生成

In text generation-based tasks111We validate InstOptima generation-based text classification, and InstOptima can be easily applied to other instruction-based modeling tasks.
我们验证了InstOptima基于生成的文本分类,并且InstOptima可以轻松应用于其他基于指令的建模任务。

在基于文本生成的任务中 1
, instructions are utilized to facilitate in-context learning Brown et al. (2020) and improve language modeling. An instruction (depicted in the right part of Fig. 1) is represented as 𝐈=Concat(𝐝,𝐞)𝐈Concat𝐝𝐞\mathbf{I}=\mathrm{Concat}(\mathbf{d},\mathbf{e})bold_I = roman_Concat ( bold_d , bold_e ), where 𝐝𝐝\mathbf{d}bold_d and 𝐞𝐞\mathbf{e}bold_e are the definition and example of the target task, respectively. 𝐝𝐝\mathbf{d}bold_d and 𝐞𝐞\mathbf{e}bold_e are token sequences similar to (𝐱,𝐲)𝒟similar-to𝐱𝐲𝒟(\mathbf{x},\mathbf{y})\sim\mathcal{D}( bold_x , bold_y ) ∼ caligraphic_D, where 𝐱𝐱\mathbf{x}bold_x, 𝐲𝐲\mathbf{y}bold_y, and 𝒟𝒟\mathcal{D}caligraphic_D denote the input, output, and task dataset, respectively. The modeling of a generation model f(,)𝑓f(\cdot,\cdot)italic_f ( ⋅ , ⋅ ) is defined as follows:
,指令被用于促进上下文学习(Brown等人,2020年)和改善语言建模。指令(如图1右侧所示)表示为 𝐈=Concat(𝐝,𝐞)𝐈Concat𝐝𝐞\mathbf{I}=\mathrm{Concat}(\mathbf{d},\mathbf{e})bold_I = roman_Concat ( bold_d , bold_e ) ,其中 𝐝𝐝\mathbf{d}bold_d𝐞𝐞\mathbf{e}bold_e 分别是目标任务的定义和示例。 𝐝𝐝\mathbf{d}bold_d𝐞𝐞\mathbf{e}bold_e 是类似于 (𝐱,𝐲)𝒟similar-to𝐱𝐲𝒟(\mathbf{x},\mathbf{y})\sim\mathcal{D}( bold_x , bold_y ) ∼ caligraphic_D 的令牌序列,其中 𝐱𝐱\mathbf{x}bold_x𝐲𝐲\mathbf{y}bold_y𝒟𝒟\mathcal{D}caligraphic_D 分别表示输入、输出和任务数据集。生成模型 f(,)𝑓f(\cdot,\cdot)italic_f ( ⋅ , ⋅ ) 的建模定义如下:

𝐲^=f(𝐱,𝐈)^𝐲𝑓𝐱𝐈\hat{\mathbf{y}}=f(\mathbf{x},\mathbf{I})over^ start_ARG bold_y end_ARG = italic_f ( bold_x , bold_I ) (1)

where 𝐲^^𝐲\hat{\mathbf{y}}over^ start_ARG bold_y end_ARG represents the generated output given 𝐱𝐱\mathbf{x}bold_x and 𝐈𝐈\mathbf{I}bold_I. In InstOptima, we aim to address the problem of automated instruction generation through multi-objective optimization.
在InstOptima中,我们旨在通过多目标优化来解决自动指令生成的问题,其中 𝐲^^𝐲\hat{\mathbf{y}}over^ start_ARG bold_y end_ARG 表示给定 𝐱𝐱\mathbf{x}bold_x𝐈𝐈\mathbf{I}bold_I 的生成输出。

2.2 Evolutionary Instruction Optimization
2.2进化指令优化

The workflow of InstOptima is illustrated in Fig. 1. We begin by initializing a parent population of instructions to start evolving. The parent population is manipulated by LLM-based operators to generate offspring. Subsequently, we employ the non-dominated sort algorithm to rank the combined population and measure the crowdness of instructions. At the end of each generation, we randomly replace some Pareto-front instructions with new instructions to enhance the diversity of the population (referred to as genes in NSGA-II). We also provide the pseudo code of the InstOptima in Appendix A.4.
InstOptima的工作流程如图1所示。我们首先初始化一个父代指令种群以开始进化。父代种群通过基于LLM的操作符进行操作以生成后代。随后,我们采用非支配排序算法对合并种群进行排名,并测量指令的拥挤度。在每一代的结束时,我们随机替换一些帕累托前沿指令,以增强种群的多样性(在NSGA-II中称为基因)。我们还在附录A.4中提供了InstOptima的伪代码。

2.2.1 Operators for Instructions
2.2.1指令的运算符

To handle the non-differentiable text search space, we formulate these operators as a text generation task based on ChatGPT. In other words, we define a set of fixed prompts 𝐏~~𝐏\tilde{\mathbf{P}}over~ start_ARG bold_P end_ARG, 𝐏~={P~dm,P~dc,P~em,P~ec}~𝐏subscript~𝑃𝑑𝑚subscript~𝑃𝑑𝑐subscript~𝑃𝑒𝑚subscript~𝑃𝑒𝑐\tilde{\mathbf{P}}=\{\tilde{P}_{dm},\tilde{P}_{dc},\tilde{P}_{em},\tilde{P}_{ec}\}over~ start_ARG bold_P end_ARG = { over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_m end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_c end_POSTSUBSCRIPT }, to guide ChatGPT in performing the instructions, where P~dm,P~dc,P~em,P~ecsubscript~𝑃𝑑𝑚subscript~𝑃𝑑𝑐subscript~𝑃𝑒𝑚subscript~𝑃𝑒𝑐\tilde{P}_{dm},\tilde{P}_{dc},\tilde{P}_{em},\tilde{P}_{ec}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_m end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_c end_POSTSUBSCRIPT are the fixed prompts for the four operations:
为了处理不可微分的文本搜索空间,我们将这些运算符定义为基于ChatGPT的文本生成任务。换句话说,我们定义了一组固定的提示 𝐏~~𝐏\tilde{\mathbf{P}}over~ start_ARG bold_P end_ARG𝐏~={P~dm,P~dc,P~em,P~ec}~𝐏subscript~𝑃𝑑𝑚subscript~𝑃𝑑𝑐subscript~𝑃𝑒𝑚subscript~𝑃𝑒𝑐\tilde{\mathbf{P}}=\{\tilde{P}_{dm},\tilde{P}_{dc},\tilde{P}_{em},\tilde{P}_{ec}\}over~ start_ARG bold_P end_ARG = { over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_m end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_c end_POSTSUBSCRIPT } ,以指导ChatGPT执行指令,其中 P~dm,P~dc,P~em,P~ecsubscript~𝑃𝑑𝑚subscript~𝑃𝑑𝑐subscript~𝑃𝑒𝑚subscript~𝑃𝑒𝑐\tilde{P}_{dm},\tilde{P}_{dc},\tilde{P}_{em},\tilde{P}_{ec}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_m end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_c end_POSTSUBSCRIPT 是四个操作的固定提示。

  • Definition Mutation (P~dm)subscript~𝑃𝑑𝑚(\tilde{P}_{dm})( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT ): This operator mutates the definition in an instruction. It can involve paraphrases and substitution of new definitions.


    定义突变:该操作符在指令中突变定义。它可以涉及释义和替换新的定义。
  • Definition Crossover (P~dc)subscript~𝑃𝑑𝑐(\tilde{P}_{dc})( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT ): This operator combines the definitions of two instructions to create a new instruction. It can involve merging or exchanging parts of the definitions between the parent instructions.


    定义交叉:该操作符将两个指令的定义结合起来,创建一个新的指令。它可以涉及将父指令之间的定义部分合并或交换。
  • Example Mutation (P~em)subscript~𝑃𝑒𝑚(\tilde{P}_{em})( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_m end_POSTSUBSCRIPT ): This operator perturbs the example to introduce diversity. It can involve modifications such as example substitution, addition, or deletion.


    示例变异 (P~em)subscript~𝑃𝑒𝑚(\tilde{P}_{em})( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_m end_POSTSUBSCRIPT ) :该操作符扰动示例以引入多样性。它可以涉及示例替换、添加或删除等修改。
  • Example Crossover (P~ec)subscript~𝑃𝑒𝑐(\tilde{P}_{ec})( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_c end_POSTSUBSCRIPT ): This operator randomly selects examples from two instructions to create a new instruction.


    示例交叉: (P~ec)subscript~𝑃𝑒𝑐(\tilde{P}_{ec})( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_c end_POSTSUBSCRIPT ) 该运算符随机选择两个指令中的示例,以创建一个新的指令。

For instance, we formulate the mutation operation as follows:
例如,我们将变异操作定义如下:

𝐝^dm=𝙲𝚑𝚊𝚝𝙶𝙿𝚃(Concat(P~dm,𝐝))subscript^𝐝𝑑𝑚𝙲𝚑𝚊𝚝𝙶𝙿𝚃Concatsubscript~𝑃𝑑𝑚𝐝\mathbf{\hat{d}}_{dm}=\texttt{ChatGPT}(\mathrm{Concat}(\tilde{P}_{dm},\mathbf{d}))over^ start_ARG bold_d end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT = ChatGPT ( roman_Concat ( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT , bold_d ) ) (2)

where 𝐝^dmsubscript^𝐝𝑑𝑚\mathbf{\hat{d}}_{dm}over^ start_ARG bold_d end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT is the new definition generated based on the original instruction 𝐈𝐈\mathbf{I}bold_I. The new instruction is denoted as 𝐈^^𝐈\mathbf{\hat{I}}over^ start_ARG bold_I end_ARG, 𝐈^=Concat(𝐝^dm,𝐞)^𝐈Concatsubscript^𝐝𝑑𝑚𝐞\mathbf{\hat{I}}=\mathrm{Concat}(\mathbf{\hat{d}}_{dm},\mathbf{e})over^ start_ARG bold_I end_ARG = roman_Concat ( over^ start_ARG bold_d end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT , bold_e ). The other operators follow a similar formulation to mutation. Further details of the fixed prompts are available in Appendix A.5.
其中 𝐝^dmsubscript^𝐝𝑑𝑚\mathbf{\hat{d}}_{dm}over^ start_ARG bold_d end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT 是基于原始指令 𝐈𝐈\mathbf{I}bold_I 生成的新定义。新指令表示为 𝐈^^𝐈\mathbf{\hat{I}}over^ start_ARG bold_I end_ARG𝐈^=Concat(𝐝^dm,𝐞)^𝐈Concatsubscript^𝐝𝑑𝑚𝐞\mathbf{\hat{I}}=\mathrm{Concat}(\mathbf{\hat{d}}_{dm},\mathbf{e})over^ start_ARG bold_I end_ARG = roman_Concat ( over^ start_ARG bold_d end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT , bold_e ) 。其他运算符遵循类似的变异公式。有关固定提示的更多详细信息,请参见附录A.5。

2.2.2 Optimization Objectives
2.2.2 优化目标

We consider three objectives =(m,l,r)𝑚𝑙𝑟\mathcal{F}=(m,l,r)caligraphic_F = ( italic_m , italic_l , italic_r ), in optimization, i.e., metrics (m𝑚mitalic_m), length (l𝑙litalic_l), and perplexity (r𝑟ritalic_r) of the instruction.
我们在优化中考虑三个目标,即指标( m𝑚mitalic_m ),长度( l𝑙litalic_l )和指令的困惑度( r𝑟ritalic_r )。

  • Performance: We use a set of metrics, such as accuracy, f1𝑓1f1italic_f 1 score, precision, and recall, obtained by evaluating the instruction to calculate the performance objective. The performance objective is represented as the reciprocal of the sum of these metrics.

  • Length: The length of the instruction is measured in terms of the number of characters. This measurement is fair regardless of the tokenization strategy.

  • Perplexity: The perplexity of the instruction is measured using the RoBERTa model.

The evaluation of objectives \mathcal{F}caligraphic_F is shown in the pseudo-code in Appendix A.4 but not depicted in Fig. 1 for simplicity.

2.3 Objective-Guided Instruction Operators

To enhance the performance of ChatGPT through in-context learning, we propose a simple yet effective objective-feedback mechanism. Specifically, we incorporate the fitness values =(m,l,r)𝑚𝑙𝑟\mathcal{F}=(m,l,r)caligraphic_F = ( italic_m , italic_l , italic_r ) into the fixed prompts. For example, we can append “Please refer to the objective values: (𝐝1,1)subscript𝐝1subscript1(\mathbf{d}_{1},\mathcal{F}_{1})( bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), (𝐝2,2)subscript𝐝2subscript2(\mathbf{d}_{2},\mathcal{F}_{2})( bold_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )” to P~dcsubscript~𝑃𝑑𝑐\tilde{P}_{dc}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT in instruction examples crossover. These operators222Please refer to Table 4 for the actual implementations of these objective-guided operators. allow ChatGPT to autonomously decide to emphasize or down-weight an instruction based on the current objectives \mathcal{F}caligraphic_F.

Table 1: The experimental performance of InstOptima. We show the Accuracy instead of the performance objective for intuitive evaluation. The symbols ‘\nearrow’ and ‘\searrow’ indicate ‘larger is better’ and ‘lower is better’, respectively. We repeat each experiment in five rounds and report the average results. The best results are in bold. The Accuracy is the best accuracy in the Pareto-front, while the Length and Perplexity are correlated with the instruction that achieves the best accuracy.
Model Dataset InstOptima RanInstruct NoInstruct
Accuracynormal-↗\nearrow Lengthnormal-↘\searrow Perplexitynormal-↘\searrow Accuracynormal-↗\nearrow Lengthnormal-↘\searrow Perplexitynormal-↘\searrow Accuracynormal-↗\nearrow
FlanT5-small Laptop14 84.9±0.2plus-or-minus0.2{}_{\pm 0.2}start_FLOATSUBSCRIPT ± 0.2 end_FLOATSUBSCRIPT 622.6±51.5plus-or-minus51.5{}_{\pm 51.5}start_FLOATSUBSCRIPT ± 51.5 end_FLOATSUBSCRIPT 1.07±0.02plus-or-minus0.02{}_{\pm 0.02}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 82.5±0.3plus-or-minus0.3{}_{\pm 0.3}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT 740.2±84.6plus-or-minus84.6{}_{\pm 84.6}start_FLOATSUBSCRIPT ± 84.6 end_FLOATSUBSCRIPT 1.07±0.05plus-or-minus0.05{}_{\pm 0.05}start_FLOATSUBSCRIPT ± 0.05 end_FLOATSUBSCRIPT 53.8±0.3plus-or-minus0.3{}_{\pm 0.3}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT
Restaurant14 84.9±0.2plus-or-minus0.2{}_{\pm 0.2}start_FLOATSUBSCRIPT ± 0.2 end_FLOATSUBSCRIPT 421.6±82.4plus-or-minus82.4{}_{\pm 82.4}start_FLOATSUBSCRIPT ± 82.4 end_FLOATSUBSCRIPT 1.11±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 82.3±0.4plus-or-minus0.4{}_{\pm 0.4}start_FLOATSUBSCRIPT ± 0.4 end_FLOATSUBSCRIPT 328.5±38.5plus-or-minus38.5{}_{\pm 38.5}start_FLOATSUBSCRIPT ± 38.5 end_FLOATSUBSCRIPT 1.15±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT 19.2±0.4plus-or-minus0.4{}_{\pm 0.4}start_FLOATSUBSCRIPT ± 0.4 end_FLOATSUBSCRIPT
SST2 89.7±0.1plus-or-minus0.1{}_{\pm 0.1}start_FLOATSUBSCRIPT ± 0.1 end_FLOATSUBSCRIPT 402.7±39.1plus-or-minus39.1{}_{\pm 39.1}start_FLOATSUBSCRIPT ± 39.1 end_FLOATSUBSCRIPT 1.09±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 88.7±0.5plus-or-minus0.5{}_{\pm 0.5}start_FLOATSUBSCRIPT ± 0.5 end_FLOATSUBSCRIPT 499.7±73.2plus-or-minus73.2{}_{\pm 73.2}start_FLOATSUBSCRIPT ± 73.2 end_FLOATSUBSCRIPT 1.16±0.02plus-or-minus0.02{}_{\pm 0.02}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 86.9±0.1plus-or-minus0.1{}_{\pm 0.1}start_FLOATSUBSCRIPT ± 0.1 end_FLOATSUBSCRIPT
AGNews 90.2±0.1plus-or-minus0.1{}_{\pm 0.1}start_FLOATSUBSCRIPT ± 0.1 end_FLOATSUBSCRIPT 452.5±27.7plus-or-minus27.7{}_{\pm 27.7}start_FLOATSUBSCRIPT ± 27.7 end_FLOATSUBSCRIPT 1.11±0.04plus-or-minus0.04{}_{\pm 0.04}start_FLOATSUBSCRIPT ± 0.04 end_FLOATSUBSCRIPT 82.9±0.6plus-or-minus0.6{}_{\pm 0.6}start_FLOATSUBSCRIPT ± 0.6 end_FLOATSUBSCRIPT 560.6±28.7plus-or-minus28.7{}_{\pm 28.7}start_FLOATSUBSCRIPT ± 28.7 end_FLOATSUBSCRIPT 1.12±0.04plus-or-minus0.04{}_{\pm 0.04}start_FLOATSUBSCRIPT ± 0.04 end_FLOATSUBSCRIPT 74.3±0.1plus-or-minus0.1{}_{\pm 0.1}start_FLOATSUBSCRIPT ± 0.1 end_FLOATSUBSCRIPT
SNLI 69.1±0.2plus-or-minus0.2{}_{\pm 0.2}start_FLOATSUBSCRIPT ± 0.2 end_FLOATSUBSCRIPT 295.3±74.8plus-or-minus74.8{}_{\pm 74.8}start_FLOATSUBSCRIPT ± 74.8 end_FLOATSUBSCRIPT 1.14±0.02plus-or-minus0.02{}_{\pm 0.02}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 50.8±0.5plus-or-minus0.5{}_{\pm 0.5}start_FLOATSUBSCRIPT ± 0.5 end_FLOATSUBSCRIPT 507.3±98.0plus-or-minus98.0{}_{\pm 98.0}start_FLOATSUBSCRIPT ± 98.0 end_FLOATSUBSCRIPT 1.09±0.07plus-or-minus0.07{}_{\pm 0.07}start_FLOATSUBSCRIPT ± 0.07 end_FLOATSUBSCRIPT 37.9±0.2plus-or-minus0.2{}_{\pm 0.2}start_FLOATSUBSCRIPT ± 0.2 end_FLOATSUBSCRIPT
MNLI 57.4±0.3plus-or-minus0.3{}_{\pm 0.3}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT 385.8±57.5plus-or-minus57.5{}_{\pm 57.5}start_FLOATSUBSCRIPT ± 57.5 end_FLOATSUBSCRIPT 1.12±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT 40.6±1.1plus-or-minus1.1{}_{\pm 1.1}start_FLOATSUBSCRIPT ± 1.1 end_FLOATSUBSCRIPT 519.7±68.6plus-or-minus68.6{}_{\pm 68.6}start_FLOATSUBSCRIPT ± 68.6 end_FLOATSUBSCRIPT 1.09±0.05plus-or-minus0.05{}_{\pm 0.05}start_FLOATSUBSCRIPT ± 0.05 end_FLOATSUBSCRIPT 37.3±0.3plus-or-minus0.3{}_{\pm 0.3}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT
FlanT5-base Laptop14 88.4±0.3plus-or-minus0.3{}_{\pm 0.3}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT 207.2±57.3plus-or-minus57.3{}_{\pm 57.3}start_FLOATSUBSCRIPT ± 57.3 end_FLOATSUBSCRIPT 1.04±0.04plus-or-minus0.04{}_{\pm 0.04}start_FLOATSUBSCRIPT ± 0.04 end_FLOATSUBSCRIPT 86.6±0.3plus-or-minus0.3{}_{\pm 0.3}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT 549.7±85.7plus-or-minus85.7{}_{\pm 85.7}start_FLOATSUBSCRIPT ± 85.7 end_FLOATSUBSCRIPT 1.10±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT 62.3±0.2plus-or-minus0.2{}_{\pm 0.2}start_FLOATSUBSCRIPT ± 0.2 end_FLOATSUBSCRIPT
Restaurant14 89.1±0.2plus-or-minus0.2{}_{\pm 0.2}start_FLOATSUBSCRIPT ± 0.2 end_FLOATSUBSCRIPT 359.4±39.7plus-or-minus39.7{}_{\pm 39.7}start_FLOATSUBSCRIPT ± 39.7 end_FLOATSUBSCRIPT 1.06±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT 87.4±0.5plus-or-minus0.5{}_{\pm 0.5}start_FLOATSUBSCRIPT ± 0.5 end_FLOATSUBSCRIPT 589.3±63.2plus-or-minus63.2{}_{\pm 63.2}start_FLOATSUBSCRIPT ± 63.2 end_FLOATSUBSCRIPT 1.11±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT 52.8±0.2plus-or-minus0.2{}_{\pm 0.2}start_FLOATSUBSCRIPT ± 0.2 end_FLOATSUBSCRIPT
SST2 94.5±0.1plus-or-minus0.1{}_{\pm 0.1}start_FLOATSUBSCRIPT ± 0.1 end_FLOATSUBSCRIPT 397.8±69.4plus-or-minus69.4{}_{\pm 69.4}start_FLOATSUBSCRIPT ± 69.4 end_FLOATSUBSCRIPT 1.08±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 93.0±0.4plus-or-minus0.4{}_{\pm 0.4}start_FLOATSUBSCRIPT ± 0.4 end_FLOATSUBSCRIPT 385.6±55.0plus-or-minus55.0{}_{\pm 55.0}start_FLOATSUBSCRIPT ± 55.0 end_FLOATSUBSCRIPT 1.12±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 92.6±0.1plus-or-minus0.1{}_{\pm 0.1}start_FLOATSUBSCRIPT ± 0.1 end_FLOATSUBSCRIPT
AGNews 93.5±0.3plus-or-minus0.3{}_{\pm 0.3}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT 300.1±73.8plus-or-minus73.8{}_{\pm 73.8}start_FLOATSUBSCRIPT ± 73.8 end_FLOATSUBSCRIPT 1.15±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 90.1±0.6plus-or-minus0.6{}_{\pm 0.6}start_FLOATSUBSCRIPT ± 0.6 end_FLOATSUBSCRIPT 485.4±68.2plus-or-minus68.2{}_{\pm 68.2}start_FLOATSUBSCRIPT ± 68.2 end_FLOATSUBSCRIPT 1.16±0.02plus-or-minus0.02{}_{\pm 0.02}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 88.1±0.1plus-or-minus0.1{}_{\pm 0.1}start_FLOATSUBSCRIPT ± 0.1 end_FLOATSUBSCRIPT
SNLI 86.6±0.3plus-or-minus0.3{}_{\pm 0.3}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT 430.9±82.2plus-or-minus82.2{}_{\pm 82.2}start_FLOATSUBSCRIPT ± 82.2 end_FLOATSUBSCRIPT 1.10±0.02plus-or-minus0.02{}_{\pm 0.02}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 86.4±0.5plus-or-minus0.5{}_{\pm 0.5}start_FLOATSUBSCRIPT ± 0.5 end_FLOATSUBSCRIPT 399.3±23.8plus-or-minus23.8{}_{\pm 23.8}start_FLOATSUBSCRIPT ± 23.8 end_FLOATSUBSCRIPT 1.11±0.04plus-or-minus0.04{}_{\pm 0.04}start_FLOATSUBSCRIPT ± 0.04 end_FLOATSUBSCRIPT 85.9±0.3plus-or-minus0.3{}_{\pm 0.3}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT
MNLI 80.2±0.4plus-or-minus0.4{}_{\pm 0.4}start_FLOATSUBSCRIPT ± 0.4 end_FLOATSUBSCRIPT 388.2±58.8plus-or-minus58.8{}_{\pm 58.8}start_FLOATSUBSCRIPT ± 58.8 end_FLOATSUBSCRIPT 1.11±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT 77.8±0.7plus-or-minus0.7{}_{\pm 0.7}start_FLOATSUBSCRIPT ± 0.7 end_FLOATSUBSCRIPT 449.1±70.3plus-or-minus70.3{}_{\pm 70.3}start_FLOATSUBSCRIPT ± 70.3 end_FLOATSUBSCRIPT 1.20±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT 74.5±0.4plus-or-minus0.4{}_{\pm 0.4}start_FLOATSUBSCRIPT ± 0.4 end_FLOATSUBSCRIPT
ChatGPT Laptop14 83.2±2.2plus-or-minus2.2{}_{\pm 2.2}start_FLOATSUBSCRIPT ± 2.2 end_FLOATSUBSCRIPT 512.9±51.5plus-or-minus51.5{}_{\pm 51.5}start_FLOATSUBSCRIPT ± 51.5 end_FLOATSUBSCRIPT 1.08±0.02plus-or-minus0.02{}_{\pm 0.02}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 83.1±0.8plus-or-minus0.8{}_{\pm 0.8}start_FLOATSUBSCRIPT ± 0.8 end_FLOATSUBSCRIPT 877.6±51.5plus-or-minus51.5{}_{\pm 51.5}start_FLOATSUBSCRIPT ± 51.5 end_FLOATSUBSCRIPT 1.05±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT 67.8±5.8plus-or-minus5.8{}_{\pm 5.8}start_FLOATSUBSCRIPT ± 5.8 end_FLOATSUBSCRIPT
Restaurant14 96.3±1.9plus-or-minus1.9{}_{\pm 1.9}start_FLOATSUBSCRIPT ± 1.9 end_FLOATSUBSCRIPT 487.3±55.9plus-or-minus55.9{}_{\pm 55.9}start_FLOATSUBSCRIPT ± 55.9 end_FLOATSUBSCRIPT 1.09±0.02plus-or-minus0.02{}_{\pm 0.02}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 92.1±1.3plus-or-minus1.3{}_{\pm 1.3}start_FLOATSUBSCRIPT ± 1.3 end_FLOATSUBSCRIPT 421.6±82.4plus-or-minus82.4{}_{\pm 82.4}start_FLOATSUBSCRIPT ± 82.4 end_FLOATSUBSCRIPT 1.10±0.02plus-or-minus0.02{}_{\pm 0.02}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 75.2±6.1plus-or-minus6.1{}_{\pm 6.1}start_FLOATSUBSCRIPT ± 6.1 end_FLOATSUBSCRIPT

3 Experimental Setup

We conducted a comprehensive set of experiments333To improve the reproducibility, we release all experimental materials in the supplementary files of the submission, including source code, experiment logs, and results, optimized instructions. to validate the performance of InstOptima. The detailed experiments setups and implementations are described in Appendix A.1.

3.1 Baseline Methods

We used random instruction (RanInstruct) generation (i.e., request ChatGPT generates several instructions similar to instructions generated by InstOptima) and no-instruction (NoInstruct) as comparison baselines. The RanInstruct generates five random instructions using the LLM to evaluate the same three objectives as InstOptima. The NoInstructablates instruction in the classification-oriented fine-tuning of Flan-T5.

3.2 Main Results

The results in Table 1 show the performance of InstOptima. Overall, InstOptima achieves superior objectives based on various base models (e.g., ChatGPT and FlanT5). For example, it outperforms all baselines on all datasets in terms of Accuracy. However, for instruction Length and Perplexity, the RanInstruct sometimes achieves better objective values. On the other hand, NoInstruct performs poorly on all datasets in terms of Accuracy, underscoring the importance of instructions in generation-based fine-tuning. Moreover, the Accuracy objective exhibits small intervals but relatively large variances, making it more challenging to optimize. However, existing methods that prioritize performance optimization struggle to handle the variances in metrics. On the other hand, the Length objective is easier to optimize due to its significant variations and greater significance. This is because long instructions can result in up to twice training times than short instructions. The Perplexity metric ranges within small intervals, indicating a moderate optimization challenge, but it significantly impacts the understanding of instruction engineers. In addition to these three objectives, InstOptima can easily accommodate additional objectives for precise control of instruction generation.

Overall, InstOptima demonstrates impressive performance in instruction optimization across various tasks and datasets.

3.3 Research Questions

We further discuss our observations and analysis by answering several research questions.

RQ1: Do the objective-guided operators help instruction optimization?

Table 2: The experimental performance of InstOptima-N on FlanT5-small. The tokens “\mathbf{-}-” and “+\mathbf{+}+” indicate worse and better objectives than InstOptima.
Dataset InstOptima-N
Accuracynormal-↗\nearrow Lengthnormal-↘\searrow Perplexitynormal-↘\searrow
Laptop14 84.4±0.2plus-or-minus0.2{}_{\pm 0.2}start_FLOATSUBSCRIPT ± 0.2 end_FLOATSUBSCRIPT \mathbf{-}- 789.3±86.2plus-or-minus86.2{}_{\pm 86.2}start_FLOATSUBSCRIPT ± 86.2 end_FLOATSUBSCRIPT \mathbf{-}- 1.07±0.02plus-or-minus0.02{}_{\pm 0.02}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT
Restaurant14 83.7±0.3plus-or-minus0.3{}_{\pm 0.3}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT \mathbf{-}- 455.8±79.9plus-or-minus79.9{}_{\pm 79.9}start_FLOATSUBSCRIPT ± 79.9 end_FLOATSUBSCRIPT \mathbf{-}- 1.12±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT \mathbf{-}-
SST2 89.6±0.1plus-or-minus0.1{}_{\pm 0.1}start_FLOATSUBSCRIPT ± 0.1 end_FLOATSUBSCRIPT \mathbf{-}- 435.2±52.1plus-or-minus52.1{}_{\pm 52.1}start_FLOATSUBSCRIPT ± 52.1 end_FLOATSUBSCRIPT \mathbf{-}- 1.12±0.02plus-or-minus0.02{}_{\pm 0.02}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT \mathbf{-}-
AGNews 86.7±0.8plus-or-minus0.8{}_{\pm 0.8}start_FLOATSUBSCRIPT ± 0.8 end_FLOATSUBSCRIPT \mathbf{-}- 535.8±69.4plus-or-minus69.4{}_{\pm 69.4}start_FLOATSUBSCRIPT ± 69.4 end_FLOATSUBSCRIPT \mathbf{-}- 1.26 ±0.12plus-or-minus0.12{}_{\pm 0.12}start_FLOATSUBSCRIPT ± 0.12 end_FLOATSUBSCRIPT \mathbf{-}-
SNLI 69.8±0.6plus-or-minus0.6{}_{\pm 0.6}start_FLOATSUBSCRIPT ± 0.6 end_FLOATSUBSCRIPT +\mathbf{+}+ 454.0±77.0plus-or-minus77.0{}_{\pm 77.0}start_FLOATSUBSCRIPT ± 77.0 end_FLOATSUBSCRIPT \mathbf{-}- 1.11±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT +\mathbf{+}+
MNLI 57.3 ±0.5plus-or-minus0.5{}_{\pm 0.5}start_FLOATSUBSCRIPT ± 0.5 end_FLOATSUBSCRIPT \mathbf{-}- 465.6±98.3plus-or-minus98.3{}_{\pm 98.3}start_FLOATSUBSCRIPT ± 98.3 end_FLOATSUBSCRIPT \mathbf{-}- 1.09 ±0.02plus-or-minus0.02{}_{\pm 0.02}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT +\mathbf{+}+

To investigate the impact of objective-guided operators on InstOptima, we conducted ablative experiments to assess the performance of InstOptima-N, which eliminates the objective guidance in the operators. The experimental results on FlanT5-small are presented in Table 2. Based on the results in Table 1 and Table 2, it is evident that InstOptima-N achieves inferior objective values on most datasets, particularly in terms of Accuracy and Length. However, for the SNLI dataset, InstOptima-N obtains better results in Accuracy and Perplexity compared to InstOptima. These findings demonstrate the effectiveness of objective-guided operators. Nonetheless, the concept of objective-guided operators is still in its early stages and warrants further investigation in future studies.

In conclusion, the experimental results indicate that objective-guided operators obtain better performance across various datasets.

RQ2: Does the number of evolution generations matter in InstOptima?

Refer to caption
Figure 2: The trajectory plots of objective values across different datasets. We plot the trajectories of 10101010 additional generations using red lines. In these figures, lower objective values indicate better performance.

Generally, a larger number of generations tends to result in better objective values after optimization. We conducted additional training for 10101010 generations on the Laptop14, SST2, and SNLI datasets to study the significance of number of generations. Based on the experimental results in Fig. 2., in most cases (e.g., Laptop14 and SNLI datasets), we observed a significant trade-off among the three objectives. However, due to the small scale of the evaluation data and population size, there were large variances in the performance objective (see the left column in Fig. 2). These variances in performance interfere with the convergence of the other two objectives, resulting in the absence of clear descending trends for the length and perplexity objectives with an increase in generations. However, this issue can be addressed by increasing the population size, number of generations, and scale of training data.

In conclusion, given the limited evaluation resources, the number of evolution generations showed limited improvement. Instead, it is important to reconcile different objective values to achieve the final instruction population.

RQ3: Are there trade-offs between different objectives?

To analyze the relationship between different objectives, we plot the Pareto front (refer to Fig. 5) of instructions into three groups. The two-dimensional Pareto fronts between pairwise objectives are presented in Fig. 3.

Refer to caption
Figure 3: Visualizations of the 2222D-Pareto fronts searched by InstOptima on three datasets. The three columns from left to right indicate the results on Laptop14, SST2 and SNLI datasets, respectively.

Overall, there is a clear trade-off between instruction length and perplexity. However, when considering the pairs of performance-length and performance-perplexity, there is no clear trade-off observed in Fig. 3. This could be attributed to the lack of strict trade-offs and the presence of noise fitness points due to the evaluation of metrics on small datasets during optimization. It is expected that this issue can be mitigated when evaluating performance on larger datasets.

Nevertheless, InstOptima consistently discovers high-quality instructions in most scenarios, regardless of the loose trade-offs between objective pairs such as performance-length and performance-perplexity. This demonstrates the effectiveness of InstOptima in obtaining a diverse set of instructions.

4 Conclusion

We propose a multi-objective instruction optimization framework to obtain a diversified set of instructions. To address the challenges posed by the large and non-differentiable text search space, InstOptima utilizes objective-guided instruction operators based on LLM, which shows impressive performance in instruction generation. However, it is important to note that multi-objective instruction optimization is still in the early stages and requires further research in the future.

5 Limitation

The first limitation of InstOptima lies in the potential crisis of local optima in the multi-objective optimization. InstOptima initializes the instruction population based on fixed manually crafted instructions, which are then mutated using LLM. Although InstOptima has been demonstrated to search for diversified and high-quality instructions in experiments, the essence on fixed initial instructions may lead to traps in local optima during the multi-objective process. In the future, the generation of initial instruction populations, such as employing randomized initial instructions, remains a topic worth exploring.

The second limitation of InstOptima is related to experimental resources. Due to resource constraints, we only utilized single-round API calls to generate new instructions using LLM. This approach overlooks the contextual information that could help in understanding objective feedback in the instruction generation. We believe that continuous dialogue with LLM will significantly improve the quality of instruction generated by LLM. Additionally, due to the difficulty of accessing LLM, we conducted experiments with smaller population sizes and fewer iterations, which may underestimate the performance of InstOptima.

Acknowledgment

This work was supported in part by the UKRI Future Leaders Fellowship under Grant MR/S017062/1 and MR/X011135/1; in part by NSFC under Grant 62376056 and 62076056; in part by the Royal Society under Grant IES/R2/212077; in part by the EPSRC under Grant 2404317; in part by the Kan Tong Po Fellowship (KTP\R1\231017); and in part by the Amazon Research Award and Alan Turing Fellowship.

References

Appendix A Appendix

A.1 Experiment Setup

A.1.1 Datasets

We selected six datasets for three classification tasks. For the aspect-based sentiment analysis (ABSA) task, we used the Laptop14 and Restaurant14 datasets Pontiki et al. (2014). For text classification (TC) tasks, we chose the SST2 Socher et al. (2013) and AGNews Zhang et al. (2015) datasets. We selected the SNLI Bowman et al. (2015) and MNLI Wang et al. (2019) datasets for the natural language inference (NLI) task. We trained our models on the first 1000100010001000 samples from the original training, validation and testing datasets, respectively.

A.1.2 Experimental PLMs

For the LLM to operate instructions, we select the ChatGPT444ChatGPT-turbo-0301 version.OpenAI (2023) with a temperature of 1111 and a maximum token length of 500500500500.

To obtain the objective value of performance, we performed instruction-based classification experiments using the FlanT5-small and FlanT5-base models Chung et al. (2022), as well as ChatGPT, which are the latest and popular PLM/LLM for instruction learning. For the calculation of semantic complexity, we employed the RoBERTa Liu et al. (2019) model from transformersWolf et al. (2020).

A.1.3 Hyper-parameter Settings

The generation size and number of generation for NSGA-II is 100100100100 and 10101010, respectively. In the fine-tuning555We use the Huggingface Trainer for fine-tuning, and the code is available in the supplementary materials. of the PLMs (i.e., FlanT5-small and FlanT5-base), we set the learning rate and batch size to 5e55𝑒55e-55 italic_e - 5 and 16161616, respectively. We fine-tune the PLMs for 3333 epochs with an L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization parameter of 0.010.010.010.01.

A.1.4 Experimental Environment

The experiments are carried out on a computer running the Cent OS 7777 operating system, equipped with an RTX 3090309030903090 GPU and a Core i-12900k12900𝑘12900k12900 italic_k processor. We use the PyTorch 2.0.02.0.02.0.02.0.0 library and transformers 4.28.04.28.04.28.04.28.0.

A.2 Additional Experiments for Summarization

A.2.1 Generative Text Summarization

We conducted experiments for a text generation task. i.e., generative summarization. To evaluate InstOptima, we used three subsets from The GigaWord dataset and the FlanT5-small model in our experiments. In these subsets, the training set contains 5k training examples, while the testing set and validation set each have 1k examples. According to the Rouge11{}_{1}start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT metric, it is evident that InstOptima performs well on the GigaWord dataset, demonstrating that it is a task-agnostic method for multi-objective instruction optimization.

Table 3: The experimental performance of InstOptima. We show the Accuracy instead of the performance objective for intuitive evaluation. The symbols \nearrow and \searrow indicate larger is better and lower is better, respectively. We repeat each experiment in five rounds and report the average results. The best results are in bold. The Accuracy is the best accuracy in the Pareto-front, while the Length and Perplexity are correlated with the instruction that achieves the best accuracy.
Model Dataset InstOptima RanInstruct NoInstruct
Accuracynormal-↗\nearrow Lengthnormal-↘\searrow Perplexitynormal-↘\searrow Accuracynormal-↗\nearrow Lengthnormal-↘\searrow Perplexitynormal-↘\searrow Accuracynormal-↗\nearrow
FlanT5-small GigaWord 33.7±0.3plus-or-minus0.3{}_{\pm 0.3}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT 586.9±91.5plus-or-minus91.5{}_{\pm 91.5}start_FLOATSUBSCRIPT ± 91.5 end_FLOATSUBSCRIPT 1.08±0.02plus-or-minus0.02{}_{\pm 0.02}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 32.9±1.9plus-or-minus1.9{}_{\pm 1.9}start_FLOATSUBSCRIPT ± 1.9 end_FLOATSUBSCRIPT 891.6±151.5plus-or-minus151.5{}_{\pm 151.5}start_FLOATSUBSCRIPT ± 151.5 end_FLOATSUBSCRIPT 1.11±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT 30.8±0.8plus-or-minus0.8{}_{\pm 0.8}start_FLOATSUBSCRIPT ± 0.8 end_FLOATSUBSCRIPT

A.2.2 Experiments based on Different Backbone Models

We have conducted experiments to demonstrate the relationship between the backbone model and performance. Due to resource limitations, we are currently using FlanT5 variants (small, base, and large, Llama is not implemented currently) as backbones to implement InstOptima. We have generated a box plot to visualize the experimental results in Fig. 4

Refer to caption
Figure 4: Box plot visualizations of the performance based on different backbone models.

The figure illustrates that performance is highly dependent on the scale of the backbone instruction-follow model. In other words, because the FlanT5-small model has limited capability to follow instructions, the accuracy achieved by an instruction is low and exhibits a larger variance compared to the larger instruction-follow models. In this context, InstOptima plays a crucial role in identifying instructions with optimized objectives.

A.3 The Visualization of Pareto-fronts

In Fig. 5, we show the visualizations of Pareto-front instructions obtained by InstOptima on the Laptop14, SST2 and SNLI datasets. Due to resource limitations, we only present the plots on the Laptop14, SST2, and SNLI datasets. We plot the first three fronts searched by NSGA-II, and the first three fronts are indicated by red, green, and blue colors, respectively.

Refer to caption
Figure 5: Visualizations of the Pareto fronts searched by InstOptima on three datasets. The PLM used to evaluate performance is FlanT5-small.

A.4 Multi-objective Optimization Algorithm

InstOptima is a multi-objective instruction optimization approach that evolves a population of instructions through a series of steps. We present the pseudo-code of InstOptima in Algorithm 1.

Firstly, the algorithm initializes a population of instructions. Then, it iteratively performs the following steps for a specified number of generations: selecting two instructions from the population, evaluating their objectives, applying LLM-based instruction operators to create new instructions, and adding them to a temporary population. After each generation, the temporary population is combined with the original population, and a selection process is applied to choose the fittest instructions. Finally, the algorithm returns the evolved population of instructions as the final results.

Input: Task dataset 𝒟𝒟\mathcal{D}caligraphic_D, Number of generations N𝑁Nitalic_N, Population size M𝑀Mitalic_M, Instruction Operators 𝐏~~𝐏\tilde{\mathbf{P}}over~ start_ARG bold_P end_ARG
Output: Evolved population of instructions 𝒫*superscript𝒫\mathcal{P}^{*}caligraphic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT
1 𝒫𝒫absent\mathcal{P}\leftarrowcaligraphic_P ← InitializePopulation(M𝑀Mitalic_M) ;
2  // Initialize the population
3 for i1normal-←𝑖1i\leftarrow 1italic_i ← 1 to N𝑁Nitalic_N do
4       𝒬𝒬\mathcal{Q}\leftarrow\emptysetcaligraphic_Q ← ∅ ;
5        // Initialize the offspring population
6       for j1normal-←𝑗1j\leftarrow 1italic_j ← 1 to M𝑀Mitalic_M do
7             𝐈1𝒫jsubscript𝐈1subscript𝒫𝑗\mathbf{I}_{1}\leftarrow\mathcal{P}_{j}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← caligraphic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ;
8              // Select parent instruction
9             𝐈2random(𝒫)subscript𝐈2random𝒫\mathbf{I}_{2}\leftarrow\text{random}(\mathcal{P})bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← random ( caligraphic_P ) ;
10              // Select random parent instruction
11             1subscript1absent\mathcal{F}_{1}\leftarrowcaligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← EvaluateObjectives (𝐈1subscript𝐈1\mathbf{I}_{1}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)666The evaluation depends on the existence of a validation set, which means if there is no validation set, such as the Restaurant14 and Laptop14 datasets, the EvaluateObjectives evaluates the test set. ;
12              // Evaluate objectives for parent 1
13             2subscript2absent\mathcal{F}_{2}\leftarrowcaligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← EvaluateObjectives (𝐈2subscript𝐈2\mathbf{I}_{2}bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) ;
14              // Evaluate objectives for parent 2
15             (𝐝1,𝐞1)𝐈1subscript𝐝1subscript𝐞1subscript𝐈1(\mathbf{d}_{1},\mathbf{e}_{1})\leftarrow\mathbf{I}_{1}( bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ← bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ;
16              // Extract definition and example from parent 1
17             (𝐝2,𝐞2)𝐈2subscript𝐝2subscript𝐞2subscript𝐈2(\mathbf{d}_{2},\mathbf{e}_{2})\leftarrow\mathbf{I}_{2}( bold_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ← bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ;
18              // Extract definition and example from parent 2
19             𝒪random(𝐏~)𝒪random~𝐏\mathcal{O}\leftarrow\text{random}(\tilde{\mathbf{P}})caligraphic_O ← random ( over~ start_ARG bold_P end_ARG ) ;
20              // Select a random operator
21             if 𝒪==P~dm\mathcal{O}==\tilde{P}_{dm}caligraphic_O = = over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT then
22                   𝐝^dm𝙲𝚑𝚊𝚝𝙶𝙿𝚃(Concat(P~dm,𝐝1,1))subscript^𝐝𝑑𝑚𝙲𝚑𝚊𝚝𝙶𝙿𝚃Concatsubscript~𝑃𝑑𝑚subscript𝐝1subscript1\mathbf{\hat{d}}_{dm}\leftarrow\texttt{ChatGPT}(\text{Concat}(\tilde{P}_{dm},\mathbf{d}_{1},\mathcal{F}_{1}))over^ start_ARG bold_d end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT ← ChatGPT ( Concat ( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT , bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ;
23                    // Generate mutated definition
24                   𝐈^Concat(𝐝^dm,𝐞1)^𝐈Concatsubscript^𝐝𝑑𝑚subscript𝐞1\mathbf{\hat{I}}\leftarrow\text{Concat}(\mathbf{\hat{d}}_{dm},\mathbf{e}_{1})over^ start_ARG bold_I end_ARG ← Concat ( over^ start_ARG bold_d end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ;
25                    // Combine mutated definition with example
26                  
27            if 𝒪==P~dc\mathcal{O}==\tilde{P}_{dc}caligraphic_O = = over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT then
28                   𝐝^dc𝙲𝚑𝚊𝚝𝙶𝙿𝚃(Concat(P~dc,𝐝1,1,𝐝2,2))subscript^𝐝𝑑𝑐𝙲𝚑𝚊𝚝𝙶𝙿𝚃Concatsubscript~𝑃𝑑𝑐subscript𝐝1subscript1subscript𝐝2subscript2\mathbf{\hat{d}}_{dc}\leftarrow\texttt{ChatGPT}(\text{Concat}(\tilde{P}_{dc},\mathbf{d}_{1},\mathcal{F}_{1},\mathbf{d}_{2},\mathcal{F}_{2}))over^ start_ARG bold_d end_ARG start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT ← ChatGPT ( Concat ( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT , bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ;
29                    // Generate crossoverd definition
30                   𝐈^Concat(𝐝^dc,𝐞1)^𝐈Concatsubscript^𝐝𝑑𝑐subscript𝐞1\mathbf{\hat{I}}\leftarrow\text{Concat}(\mathbf{\hat{d}}_{dc},\mathbf{e}_{1})over^ start_ARG bold_I end_ARG ← Concat ( over^ start_ARG bold_d end_ARG start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ;
31                    // Combine crossoverd definition with example
32                  
33            if 𝒪==P~em\mathcal{O}==\tilde{P}_{em}caligraphic_O = = over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_m end_POSTSUBSCRIPT then
34                   𝐞^em𝙲𝚑𝚊𝚝𝙶𝙿𝚃(Concat(P~em,𝐞1,1))subscript^𝐞𝑒𝑚𝙲𝚑𝚊𝚝𝙶𝙿𝚃Concatsubscript~𝑃𝑒𝑚subscript𝐞1subscript1\mathbf{\hat{e}}_{em}\leftarrow\texttt{ChatGPT}(\text{Concat}(\tilde{P}_{em},\mathbf{e}_{1},\mathcal{F}_{1}))over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_e italic_m end_POSTSUBSCRIPT ← ChatGPT ( Concat ( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_m end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ;
35                    // Generate mutated example
36                   𝐈^Concat(𝐝1,𝐞^em)^𝐈Concatsubscript𝐝1subscript^𝐞𝑒𝑚\mathbf{\hat{I}}\leftarrow\text{Concat}(\mathbf{d}_{1},\mathbf{\hat{e}}_{em})over^ start_ARG bold_I end_ARG ← Concat ( bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_e italic_m end_POSTSUBSCRIPT ) ;
37                    // Combine original definition with mutated example
38                  
39            if 𝒪==P~ec\mathcal{O}==\tilde{P}_{ec}caligraphic_O = = over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_c end_POSTSUBSCRIPT then
40                   𝐞^ec𝙲𝚑𝚊𝚝𝙶𝙿𝚃(Concat(P~ec,𝐞1,1,𝐞2,2))subscript^𝐞𝑒𝑐𝙲𝚑𝚊𝚝𝙶𝙿𝚃Concatsubscript~𝑃𝑒𝑐subscript𝐞1subscript1subscript𝐞2subscript2\mathbf{\hat{e}}_{ec}\leftarrow\texttt{ChatGPT}(\text{Concat}(\tilde{P}_{ec},\mathbf{e}_{1},\mathcal{F}_{1},\mathbf{e}_{2},\mathcal{F}_{2}))over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_e italic_c end_POSTSUBSCRIPT ← ChatGPT ( Concat ( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_c end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ;
41                    // Generate crossoverd example
42                   𝐈^Concat(𝐝1,𝐞^ec)^𝐈Concatsubscript𝐝1subscript^𝐞𝑒𝑐\mathbf{\hat{I}}\leftarrow\text{Concat}(\mathbf{d}_{1},\mathbf{\hat{e}}_{ec})over^ start_ARG bold_I end_ARG ← Concat ( bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_e italic_c end_POSTSUBSCRIPT ) ;
43                    // Combine original definition with crossoverd example
44                  
45            𝒬𝒬{𝐈^}𝒬𝒬^𝐈\mathcal{Q}\leftarrow\mathcal{Q}\cup\{\mathbf{\hat{I}}\}caligraphic_Q ← caligraphic_Q ∪ { over^ start_ARG bold_I end_ARG } ;
46              // Add offspring to the population
47            
48      𝒬*superscript𝒬absent\mathcal{Q}^{*}\leftarrowcaligraphic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ← CombinePopulations(𝒫𝒫\mathcal{P}caligraphic_P, 𝒬𝒬\mathcal{Q}caligraphic_Q) ;
49        // Combine parent and offspring populations
50       𝒫𝒫absent\mathcal{P}\leftarrowcaligraphic_P ← SelectPopulation(𝒬*superscript𝒬\mathcal{Q}^{*}caligraphic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, M𝑀Mitalic_M) ;
51        // Select the best individuals for the next generation
52      
53𝒫*=𝒫superscript𝒫𝒫\mathcal{P}^{*}=\mathcal{P}caligraphic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = caligraphic_P ;
54  // Set the evolved population as the final population
55 return 𝒫*superscript𝒫\mathcal{P}^{*}caligraphic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ;
56  // Return the evolved population
Algorithm 1 The pseudo code of InstOptima.

A.5 Fixed Prompts for Instruction Operators

The prompts in green are the trigger of objective-guided instruction generation.

Table 4: The fixed prompts used to implement LLM-based instructions. “<Input>” indicates the input of the operators. The green keywords are the triggers of objective-guided instruction generation.
Operators Prompts Input
P~dmsubscript~𝑃𝑑𝑚\tilde{P}_{dm}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT

I want you to be a professional prompt engineer. Now I am working on the multi-objective evolutionary prompt optimization, and I need your help to design and optimize the template prompt. Here I give you an example template prompt, please understand the meaning of the prompt and modify it. Given the minimization objectives, please be creative and output the paraphrased or mutated prompt. Please remove Minimization objectives in the output: <Input>

(𝐝,𝐝\mathbf{d},\mathcal{F}bold_d , caligraphic_F)
P~dcsubscript~𝑃𝑑𝑐\tilde{P}_{dc}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT

I want you to be a professional prompt engineer. Now I am working on the multi-objective evolutionary prompt optimization for sentiment analysis, and I need your help to design and optimize the template prompt. Here I give you two template prompts, please understand the meaning of the two prompts and crossover them into a new prompt. Given the minimization objectives, please be creative and output the generated new prompt based on the two examples. Please remove Minimization objectives in the output: <Input>

(𝐝1,1subscript𝐝1subscript1\mathbf{d}_{1},\mathcal{F}_{1}bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝐝2,2subscript𝐝2subscript2\mathbf{d}_{2},\mathcal{F}_{2}bold_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)
P~emsubscript~𝑃𝑒𝑚\tilde{P}_{em}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_m end_POSTSUBSCRIPT

I want you to be a professional prompt engineer. Now I am working on the multi-objective evolutionary prompt optimization for sentiment analysis, and I need your help to design and optimize the template prompt. Here I give you two groups of examples for completing the prompt, please generate new examples to substitute the following examples and there are no more than two examples in the new prompt. Given the minimization objectives, please be creative and output the generated example in the same format. Please remove Minimization objectives in the output: <Input>

(𝐞,𝐞\mathbf{e},\mathcal{F}bold_e , caligraphic_F)
P~ecsubscript~𝑃𝑒𝑐\tilde{P}_{ec}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e italic_c end_POSTSUBSCRIPT

I want you to be a professional prompt engineer. Now I am working on the multi-objective evolutionary prompt optimization for sentiment analysis, and I need your help to design and optimize the template prompt. Here I give you two groups of examples for completing the prompt, please read the examples of the two groups of examples and crossover the examples into a new example group and there are no more than two examples in the new examples. Given the minimization objectives, please be creative and output the crossovered the examples. Please remove Minimization objectives in the output: <Input>

(𝐞1,1subscript𝐞1subscript1\mathbf{e}_{1},\mathcal{F}_{1}bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝐞2,2subscript𝐞2subscript2\mathbf{e}_{2},\mathcal{F}_{2}bold_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)