ExpressivityArena: Can LLMs Express Information Implicitly?

Abstract

While Large Language Models (LLMs) have demonstrated remarkable performance in certain dimensions, their ability to express implicit language cues that human use for effective communication remains unclear. This paper presents ExpressivityArena, a Python library for measuring the implicit communication abilities of LLMs. We provide a comprehensive framework to evaluate expressivity of arbitrary LLMs and explore its practical implications. To this end, we refine the definition and measurements of “expressivity,” and use our framework in a set of small experiments. These experiments test LLMs in creative and logical tasks such as poetry, coding, and emotion-based responses. They are then evaluated by an automated grader, through ExpressivityArena, which we verify to be the most pragmatic for testing expressivity. Building on these experiments, we deepen our understanding of the expressivity of LLMs by assessing their ability to remain expressive in conversations. Our findings indicate that LLMs are capable of generating and understanding expressive content, however, with some limitations. These insights will inform the future development and deployment of expressive LLMs. We provide the code for ExpressivityArena alongside our paper.

Key Features

Expressivity Evaluation Framework: A Python library developed to test Large Language Models' (LLMs) ability to convey implicit information in text.

Automated Grader: Utilizes an automated grading system for efficiently testing LLM responses across various tasks like poetry and code generation.

Diverse Experiment Domains: Evaluates LLMs on both high-expressivity tasks (e.g., poetry) and low-expressivity tasks (e.g., code generation).

Key Results

LLM Performance in Poetry vs. Code: LLMs performed significantly better in expressive tasks like poetry than in more functional tasks like code generation, suggesting limitations in conveying expressive intent in programming.

Conversational Expressivity: LLMs could maintain expressive cues over the course of simulated conversations, with a consistent expressivity increase for profession-based signals but a decrease over time for emotion-based signals.

Automated Grader Accuracy: The automated grader performed comparably to human evaluators, validating its use in assessing LLM-generated text expressivity.

Components

LLM (Language Model):

The LLM class represents the functionality of a Language Model.

name: The name of the LLM for identification purposes.
get_response: A function representing the prompt -> response functionality of the LLM.

ExpressivityPrompt:

The ExpressivityPrompt class encapsulates a prompt to produce a response conveying a particular signal in a specific context.

instruction: The type of response to generate, such as "poem," "speech," or "Python program."
expressivity_signal: The signal to encode into the response, such as "sad," "secretive," or "well-educated."

ExpressivityResult:

The ExpressivityResult class represents the result of an ExpressivityArena experiment.

LLM: The LLM used for the experiment.
prompt: The prompt used for the experiment.
result: The response the LLM gave for the prompt.
grade: A boolean indicating whether the response effectively expressed the signal.

Grader:

The Grader class is a base class for grading whether responses express a given signal or not. Various grader schemas are offered.

Usage

Evaluation

The evaluate function evaluates an ExpressivityArena prompt to produce an expressive result and then grades the expressivity of that response.

evaluate(llm: LLM, grader: Grader, prompt: ExpressivityPrompt) -> ExpressivityResult

Batch Evaluation

The batch_evaluate function returns a list of ExpressivityArena experiment results for a list of prompts.

evaluate(llm: LLM, grader: Grader, prompts: List[ExpressivityPrompt]) -> List[ExpressivityResult]

Example

from arena import evaluate_category
from context import SignalCategory
from grader import MultipleChoiceGrader
from llm import LLM

# Initialize LLM and Grader
def gpt_response_function(prompt: str):
    # Fetch a response for the given prompt from the GPT API...
    return response

llm = LLM(name="GPT", get_response=gpt_response_function)
grader = MultipleChoiceGrader(llm)

# Define signals
genres_category = SignalCategory("genres", [
        "horror", "romance", "thriller", "comedy", "drama"
])

# Evaluate prompts
results = evaluate_category(llm, grader, "short story", genres_category)

# Process results
for result in results:
    print(f"Prompt: {result.prompt}")
    print(f"Result: {result.result}")
    print(f"Grade: {'Expressive' if result.grade else 'Not Expressive'}")
    print()

Reference

Joshua Tint, Som Sagar, Aditya Taparia, Kelly Raines, Bimsara Pathiraja, Caleb Liu, Ransalu Senanayake. "ExpressivityArena: Can LLMs Express Information Implicitly?" In Proceedings of NAACL 2024.

BibTeX Citation:

@misc{tint2024expressivityarenallmsexpressinformation,
      title={ExpressivityArena: Can LLMs Express Information Implicitly?}, 
      author={Joshua Tint and Som Sagar and Aditya Taparia and Kelly Raines and Bimsara Pathiraja and Caleb Liu and Ransalu Senanayake},
      year={2024},
      eprint={2411.08010},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.08010}, 
}