Building large-scale audio-language datasets is crucial yet challenging for training audio-language models, primarily due to its time-consuming and labour-intensive nature.
Although large language models (LLMs) have greatly enhanced the efficiency of this process, current LLM-based pipelines for generating audio-text data still lack the capability to incorporate detailed audio information.
In this paper, we propose a novel pipeline leveraging large audio-language models to generate large-scale, fine-grained audio captions automatically.
Based on this approach, we create AudioSetCaps, a dataset comprising 1.9 million audio-caption pairs derived from recordings in AudioSet.
We evaluate AudioSetCaps on two downstream tasks: audio-text retrieval and automated audio captioning.
Models trained with AudioSetCaps achieve state-of-the-art performance on both tasks, demonstrating the high quality of the generated captions.
Notably, our proposed data-labelling pipeline employs open-source APIs and can run on a consumer-grade GPU.
To facilitate further advancements in this field, we have made our code, audio-caption paired data, and pre-trained models on downstream tasks publicly available.
|