Abstract
While Large Language Models (LLMs) have demonstrated remarkable performance in certain dimensions, their ability to express implicit language cues that human use for effective communication remains unclear. This paper presents ExpressivityArena, a Python library for measuring the implicit communication abilities of LLMs. We provide a comprehensive framework to evaluate expressivity of arbitrary LLMs and explore its practical implications. To this end, we refine the definition and measurements of “expressivity,” and use our framework in a set of small experiments. These experiments test LLMs in creative and logical tasks such as poetry, coding, and emotion-based responses. They are then evaluated by an automated grader, through ExpressivityArena, which we verify to be the most pragmatic for testing expressivity. Building on these experiments, we deepen our understanding of the expressivity of LLMs by assessing their ability to remain expressive in conversations. Our findings indicate that LLMs are capable of generating and understanding expressive content, however, with some limitations. These insights will inform the future development and deployment of expressive LLMs. We provide the code for ExpressivityArena alongside our paper.
Key Features
Key Results
Components
The LLM class represents the functionality of a Language Model.
name: The name of the LLM for identification purposes.get_response: A function representing the prompt -> response functionality of the LLM.
The ExpressivityPrompt class encapsulates a prompt to produce a response conveying a particular signal in a specific context.
instruction: The type of response to generate, such as "poem," "speech," or "Python program."expressivity_signal: The signal to encode into the response, such as "sad," "secretive," or "well-educated."
The ExpressivityResult class represents the result of an ExpressivityArena experiment.
LLM: The LLM used for the experiment.prompt: The prompt used for the experiment.result: The response the LLM gave for the prompt.grade: A boolean indicating whether the response effectively expressed the signal.
The Grader class is a base class for grading whether responses express a given signal or not. Various grader schemas are offered.
Usage
Evaluation
The evaluate function evaluates an ExpressivityArena prompt to produce an expressive result and then grades the expressivity of that response.
evaluate(llm: LLM, grader: Grader, prompt: ExpressivityPrompt) -> ExpressivityResult
Batch Evaluation
The batch_evaluate function returns a list of ExpressivityArena experiment results for a list of prompts.
evaluate(llm: LLM, grader: Grader, prompts: List[ExpressivityPrompt]) -> List[ExpressivityResult]
Example
from arena import evaluate_category
from context import SignalCategory
from grader import MultipleChoiceGrader
from llm import LLM
# Initialize LLM and Grader
def gpt_response_function(prompt: str):
# Fetch a response for the given prompt from the GPT API...
return response
llm = LLM(name="GPT", get_response=gpt_response_function)
grader = MultipleChoiceGrader(llm)
# Define signals
genres_category = SignalCategory("genres", [
"horror", "romance", "thriller", "comedy", "drama"
])
# Evaluate prompts
results = evaluate_category(llm, grader, "short story", genres_category)
# Process results
for result in results:
print(f"Prompt: {result.prompt}")
print(f"Result: {result.result}")
print(f"Grade: {'Expressive' if result.grade else 'Not Expressive'}")
print()
Reference
Joshua Tint, Som Sagar, Aditya Taparia, Kelly Raines, Bimsara Pathiraja, Caleb Liu, Ransalu Senanayake. "ExpressivityArena: Can LLMs Express Information Implicitly?" In Proceedings of NAACL 2024.
BibTeX Citation:
@misc{tint2024expressivityarenallmsexpressinformation,
title={ExpressivityArena: Can LLMs Express Information Implicitly?},
author={Joshua Tint and Som Sagar and Aditya Taparia and Kelly Raines and Bimsara Pathiraja and Caleb Liu and Ransalu Senanayake},
year={2024},
eprint={2411.08010},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.08010},
}