AI Made Simple
Posts
🚀 Unleashing Ultra-Long Text Generation with LongWriter

🚀 Unleashing Ultra-Long Text Generation with LongWriter

Hassan Dhia
August 20, 2024

The world of large language models (LLMs) has seen remarkable progress, but there's always been a ceiling when it comes to output length. Most models cap out around 2,000 words. The team behind LongWriter set out to break this barrier, aiming to push the boundaries of what LLMs can produce. Their goal? To enable these models to generate coherent texts exceeding 10,000 words.

🧠 The Core Problem

The researchers identified a key limitation: the scarcity of long-output examples in existing supervised fine-tuning (SFT) datasets. If the training data only includes short texts, the model naturally struggles with longer outputs. This insight led them to develop a novel approach to tackle the problem head-on.

⚙️ The AgentWrite Pipeline

The heart of their solution is the AgentWrite pipeline, which breaks down the daunting task of generating ultra-long texts into manageable subtasks. Here's how it works:

Planning Stage: The model first creates a detailed writing plan. This plan outlines the structure and target word count for each paragraph based on the user's input.
Writing Stage: Following the plan, the model generates content for each paragraph sequentially.

By decomposing the task, the model can handle each part more effectively, leading to coherent and extended outputs.

🗂 Building the LongWriter-6k Dataset

To support this new approach, the team constructed the LongWriter-6k dataset. This dataset includes 6,000 SFT data points with output lengths ranging from 2,000 to 32,000 words. By training models on this dataset, they could scale the output length of existing models significantly.

📝 Evaluating Performance with LongBench-Write

To rigorously assess their approach, the researchers developed LongBench-Write, a comprehensive benchmark tailored for evaluating ultra-long text generation capabilities. This benchmark allowed them to measure the effectiveness of their models in generating extended outputs.

🎯 Experimental Results

The experiments were conducted using both proprietary and open-source models. The results were impressive:

Models trained with the LongWriter-6k dataset could generate outputs up to 20,000 words.
The LongWriter-9B model achieved state-of-the-art performance on LongBench-Write, surpassing even larger proprietary models.

🎉 Advantages and Limitations

Advantages:

Extended Output Length: The approach successfully scales the output window size of LLMs to over 10,000 words.
Maintained Quality: Despite the increased length, the output quality remains high.
Innovative Pipeline: The AgentWrite pipeline introduces a novel method for managing long writing tasks.

Limitations:

Data Dependency: The approach relies heavily on the availability and quality of long-output SFT data.
Inference Efficiency: Generating ultra-long outputs can be computationally intensive and may impact inference efficiency.

🔍 Conclusion

The LongWriter project represents a significant advancement in extending the output length capabilities of long context LLMs. By introducing the AgentWrite pipeline and constructing the LongWriter-6k dataset, the researchers have shown that with appropriate data, LLMs can generate coherent outputs exceeding 10,000 words. While there are challenges to address, such as data dependency and inference efficiency, this work opens new possibilities for ultra-long text generation. Future efforts should focus on further extending output lengths, improving data quality, and enhancing inference efficiency.

[READ PAPER]