01 — Overview

Abstract

Proceedings of SilkRoadNLP 2026, EACL · Rabat, Morocco · March 29, 2026 · pp. 105–113

Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources.

We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from over-correction tendencies and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% while maintaining efficiency suitable for real-time applications.

02 — Motivation

The Power of a Single Comma

Persian punctuation placement can completely reverse a sentence's meaning — a critical challenge for ASR pipelines.

Without punctuation
bakhshesh lazem nist e'damesh konid
"No mercy needed, execute him" — Negative
↓ Add comma
With punctuation
bakhshesh, lazem nist e'damesh konid
"Forgiveness, no need to execute him" — Positive
Without punctuation
nah baba rast migi
"No way, are you serious?" — Sarcastic
↓ Add comma & period
With punctuation
nah, baba rast migi.
"No, dad, you're right." — Affirmative
03 — Contributions

Key Contributions

Three primary contributions to Persian NLP and low-resource language processing.

PersianPunc Dataset

A curated corpus of 17 million filtered and deduplicated samples spanning formal and informal Persian text across six complementary corpora and diverse domains.

Fine-tuned ParsBERT Model

Token-level sequence labeling achieving 91.33% macro F1 and 97.28% micro F1 — restoring 4 distinct Persian punctuation marks with real-time efficiency.

Rigorous LLM Comparison

Systematic evaluation showing our lightweight model surpasses GPT-4o and GPT-4o-mini on accuracy, full sentence match, and computational efficiency.

04 — Performance

Results & Benchmarks

Evaluated on 1,000 held-out Persian sentences from PersianPunc.

91.33
Macro F1 (%)
97.28
Micro F1 (%)
61.80
Full Sent. Match (%)
17M
Training Samples
Persian punctuation restoration model comparison
Model Test Set Macro F1 (%) Full Sent. Match (%)
CRF (Hosseini & Sameti, 2017) Hosseini et al. corpus † 69.00
ViraPart (Farokhshad et al., 2021) Bijankhan corpus † 92.13
GPT-4o-mini PersianPunc test set 79.54 38.01
GPT-4o (OpenAI, 2023) PersianPunc test set 85.96 50.10
PersianPunc-ParsBERT (Ours) PersianPunc test set 91.33 61.80

† CRF and ViraPart results are on different test sets and cannot be directly compared to our results; shown for contextual reference only.

05 — Breakdown

Per-Punctuation F1 Scores

Performance across each of the four restored punctuation marks on the 1,000-sentence test set.

Period (.)
98.71
Colon (:)
90.45
Question (؟)
88.89
Persian Comma (،)
80.03
Macro Average
91.33
06 — Data

Dataset Composition

17,102,014 unique samples after SHA-256 hash-based exact deduplication across six source corpora.

Formal Academic Text

Bijankhan-Peykare Corpus, Persian Medical QA, and Persian Wikipedia — standardised punctuation in formal literary, medical, and encyclopedic contexts.

Contemporary Informal Text

Persian Telegram Channels, Farsi Stories, and Blog Dataset V2 — modern conversational patterns covering social media, narrative fiction, and personal blogging.

Quality Pipeline

Multi-stage filtering: structural, content, and linguistic checks. Exact deduplication via SHA-256 with whitespace normalisation across all sources.

Punctuation Distribution

Persian comma (50.1%), Period (35.5%), Colon (10.0%), Exclamation (2.9%), Question (1.6%). Average 2.51 punctuation marks per sentence.

07 — Open Resources

All Resources Are Public

Dataset, model, and paper freely available to support future research in Persian and low-resource NLP.

08 — Reference

How to Cite

Please use the official ACL Anthology citation.

BibTeX
@inproceedings{ranjbar-kalahroodi-etal-2026-persianpunc, title = {{P}ersian{P}unc: A Large-Scale Dataset and {BERT}-Based Approach for {P}ersian Punctuation Restoration}, author = {Ranjbar Kalahroodi, Mohammad Javad and Faili, Heshaam and Shakery, Azadeh}, editor = {Merchant, Rayyan and Megerdoomian, Karine}, booktitle = {The Proceedings of the First Workshop on {NLP} and {LLM}s for the {I}ranian Language Family}, month = mar, year = {2026}, address = {Rabat, Morocco}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2026.silkroadnlp-1.11/}, doi = {10.18653/v1/2026.silkroadnlp-1.11}, pages = {105--113}, ISBN = {979-8-89176-371-5}, }

DOI: 10.18653/v1/2026.silkroadnlp-1.11  ·  ISBN: 979-8-89176-371-5