Usage Guide
This guide covers how to use phrasplit’s Python API for text splitting.
Splitting Sentences
The split_sentences() function uses spaCy’s NLP pipeline to
intelligently detect sentence boundaries:
from phrasplit import split_sentences
text = "Dr. Smith is here. She has a Ph.D. in Chemistry."
sentences = split_sentences(text)
print(sentences)
# ['Dr. Smith is here.', 'She has a Ph.D. in Chemistry.']
The function correctly handles:
Abbreviations: Mr., Mrs., Dr., Prof., etc.
Acronyms: U.S.A., U.K., etc.
Titles: Ph.D., M.D., etc.
URLs: www.example.com
Ellipses: Text with… ellipses
Example with abbreviations:
text = "Mr. Brown met Prof. Green. They discussed the U.S.A. case."
sentences = split_sentences(text)
# ['Mr. Brown met Prof. Green.', 'They discussed the U.S.A. case.']
Colon Splitting
By default, colons are treated as sentence terminators. This is useful for news-style text:
text = "Breaking News: The event has started."
sentences = split_sentences(text)
# ['Breaking News:', 'The event has started.']
# Disable colon splitting if needed
sentences = split_sentences(text, split_on_colon=False)
# ['Breaking News: The event has started.']
Splitting Clauses
The split_clauses() function splits text at commas, creating
natural pause points ideal for audiobook and text-to-speech applications:
from phrasplit import split_clauses
text = "I like coffee, and I like tea."
clauses = split_clauses(text)
print(clauses)
# ['I like coffee,', 'and I like tea.']
The comma is kept at the end of each clause, preserving the original punctuation.
More complex example:
text = "When the sun rose, the birds began to sing, and the day started."
clauses = split_clauses(text)
# ['When the sun rose,', 'the birds began to sing,', 'and the day started.']
Splitting Paragraphs
The split_paragraphs() function splits text at double newlines:
from phrasplit import split_paragraphs
text = """First paragraph with some text.
Second paragraph with more text.
Third paragraph."""
paragraphs = split_paragraphs(text)
# ['First paragraph with some text.',
# 'Second paragraph with more text.',
# 'Third paragraph.']
The function handles multiple blank lines and whitespace-only lines:
text = "First.\n\n\n\nSecond." # Multiple blank lines
paragraphs = split_paragraphs(text)
# ['First.', 'Second.']
Hierarchical Splitting with split_text
The split_text() function provides a unified interface for
splitting text while preserving paragraph and sentence structure. This is
particularly useful for audiobook generation where you need different pause
lengths between paragraphs, sentences, and clauses.
from phrasplit import split_text, Segment
text = "First sentence. Second sentence.\n\nNew paragraph here."
segments = split_text(text, mode="sentence")
for seg in segments:
print(f"Paragraph {seg.paragraph}, Sentence {seg.sentence}: {seg.text}")
# Paragraph 0, Sentence 0: First sentence.
# Paragraph 0, Sentence 1: Second sentence.
# Paragraph 1, Sentence 0: New paragraph here.
Available Modes
"paragraph": Returns paragraphs only (sentenceis None)"sentence": Returns sentences with paragraph tracking"clause": Returns clauses with paragraph and sentence tracking
# Paragraph mode
segments = split_text(text, mode="paragraph")
# Each segment has sentence=None
# Sentence mode (default)
segments = split_text(text, mode="sentence")
# Each segment has paragraph and sentence indices
# Clause mode - finest granularity
text = "Hello, world. Goodbye, friend."
segments = split_text(text, mode="clause")
# Returns: Hello, | world. | Goodbye, | friend.
# All with paragraph and sentence tracking
Detecting Structure Changes
Use the Segment fields to detect when paragraphs or sentences change:
from phrasplit import split_text
text = "Sent 1. Sent 2.\n\nSent 3."
segments = split_text(text, mode="sentence")
for i, seg in enumerate(segments):
if i > 0 and seg.paragraph != segments[i-1].paragraph:
print("--- New Paragraph ---")
print(seg.text)
Splitting Long Lines
The split_long_lines() function breaks long lines at natural
boundaries (sentences and clauses) to fit within a maximum length:
from phrasplit import split_long_lines
text = "This is a very long sentence. This is another sentence that makes it even longer."
lines = split_long_lines(text, max_length=40)
# Each line will be <= 40 characters when possible
The splitting strategy:
First, try to split at sentence boundaries
If still too long, split at clause boundaries (commas)
If still too long, split at word boundaries
Using Different Language Models
All functions that use spaCy accept a language_model parameter:
from phrasplit import split_sentences
# Use a larger, more accurate model
sentences = split_sentences(text, language_model="en_core_web_lg")
# Use a model for another language
sentences = split_sentences(german_text, language_model="de_core_news_sm")
Make sure to download the model first:
python -m spacy download de_core_news_sm
Processing Pipeline Example
Here’s a complete example of processing a document:
from phrasplit import split_paragraphs, split_sentences, split_clauses
def process_document(text):
"""Process a document into structured parts."""
result = []
for para_idx, paragraph in enumerate(split_paragraphs(text)):
para_data = {"paragraph": para_idx + 1, "sentences": []}
for sent_idx, sentence in enumerate(split_sentences(paragraph)):
sent_data = {
"sentence": sent_idx + 1,
"text": sentence,
"clauses": split_clauses(sentence)
}
para_data["sentences"].append(sent_data)
result.append(para_data)
return result
# Example usage
text = """Hello world, this is a test. Another sentence here.
Second paragraph with more content, and some clauses."""
structure = process_document(text)
Simplified Pipeline with split_text
The same can be achieved more simply with split_text():
from phrasplit import split_text
text = """Hello world, this is a test. Another sentence here.
Second paragraph with more content, and some clauses."""
# Get all clauses with full structure information
segments = split_text(text, mode="clause")
for seg in segments:
print(f"P{seg.paragraph} S{seg.sentence}: {seg.text}")