PersianPunc Dataset
A curated corpus of 17 million filtered and deduplicated samples spanning formal and informal Persian text across six complementary corpora and diverse domains.
University of Tehran · Institute for Research in Fundamental Sciences (IPM)
SilkRoadNLP 2026 — First Workshop on NLP & LLMs for the Iranian Language Family · EACL 2026 · Rabat, Morocco
Proceedings of SilkRoadNLP 2026, EACL · Rabat, Morocco · March 29, 2026 · pp. 105–113
Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources.
We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from over-correction tendencies and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% while maintaining efficiency suitable for real-time applications.
Persian punctuation placement can completely reverse a sentence's meaning — a critical challenge for ASR pipelines.
Three primary contributions to Persian NLP and low-resource language processing.
A curated corpus of 17 million filtered and deduplicated samples spanning formal and informal Persian text across six complementary corpora and diverse domains.
Token-level sequence labeling achieving 91.33% macro F1 and 97.28% micro F1 — restoring 4 distinct Persian punctuation marks with real-time efficiency.
Systematic evaluation showing our lightweight model surpasses GPT-4o and GPT-4o-mini on accuracy, full sentence match, and computational efficiency.
Evaluated on 1,000 held-out Persian sentences from PersianPunc.
| Model | Test Set | Macro F1 (%) | Full Sent. Match (%) |
|---|---|---|---|
| CRF (Hosseini & Sameti, 2017) | Hosseini et al. corpus † | 69.00 | — |
| ViraPart (Farokhshad et al., 2021) | Bijankhan corpus † | 92.13 | — |
| GPT-4o-mini | PersianPunc test set | 79.54 | 38.01 |
| GPT-4o (OpenAI, 2023) | PersianPunc test set | 85.96 | 50.10 |
| PersianPunc-ParsBERT (Ours) | PersianPunc test set | 91.33 | 61.80 |
† CRF and ViraPart results are on different test sets and cannot be directly compared to our results; shown for contextual reference only.
Performance across each of the four restored punctuation marks on the 1,000-sentence test set.
17,102,014 unique samples after SHA-256 hash-based exact deduplication across six source corpora.
Bijankhan-Peykare Corpus, Persian Medical QA, and Persian Wikipedia — standardised punctuation in formal literary, medical, and encyclopedic contexts.
Persian Telegram Channels, Farsi Stories, and Blog Dataset V2 — modern conversational patterns covering social media, narrative fiction, and personal blogging.
Multi-stage filtering: structural, content, and linguistic checks. Exact deduplication via SHA-256 with whitespace normalisation across all sources.
Persian comma (50.1%), Period (35.5%), Colon (10.0%), Exclamation (2.9%), Question (1.6%). Average 2.51 punctuation marks per sentence.
Dataset, model, and paper freely available to support future research in Persian and low-resource NLP.
Full paper with methodology, experiments, appendices, and zero-shot LLM evaluation prompts. DOI: 10.18653/v1/2026.silkroadnlp-1.11
ACL AnthologyThe curated 1M-sample subset used for model training and benchmarking, with train / validation / test splits.
HuggingFace DatasetsThe complete large-scale corpus of 17,102,014 unique Persian sentences spanning formal and informal text across all six source domains.
HuggingFace DatasetsFine-tuned ParsBERT for Persian punctuation restoration — lightweight, fast, and ready for real-time ASR pipelines.
Model upload coming soonPlease use the official ACL Anthology citation.
DOI: 10.18653/v1/2026.silkroadnlp-1.11 · ISBN: 979-8-89176-371-5