These datasets currently includes text only. Audio data will be published in a future release. Access to the dataset can be requested through the contact person.
More datasets for languages and dialects are coming soon, including Malay, Minangkabau, Banjarese, Batak, Buginese, Javanese, and Sundanese.
This dataset is distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). Users are permitted to use, reproduce, and modify the dataset for non-commercial research and educational purposes, provided that appropriate credit is given to the original authors and source. Any derivative works or adaptations based on this dataset must be distributed under the same license (CC BY-NC-SA 4.0). Commercial use of the dataset, in whole or in part, including but not limited to incorporation into proprietary systems or services, is prohibited without prior written permission from the authors.
Authorship for each dataset reflects its primary contributor(s), with Yusra and Muhammad Fikry included as co-authors across all datasets. For example:
Fauzan, B., Yusra, & Fikry, M. (2026). Bahasa Minangkabau (Dialek Lima Puluh Kota) NLP dataset [Data set]. Bhinneka NLP-RG. https://nlp-rg.yusrafikry.com