API Reference

This page contains the complete API reference for phrasplit.

Main Functions

split_sentences

phrasplit.split_sentences(text: str, language_model: str = 'en_core_web_sm', apply_corrections: bool = True, split_on_colon: bool = True) list[str][source]

Split text into sentences using spaCy.

Args:

text: Input text language_model: spaCy language model to use apply_corrections: Whether to apply post-processing corrections for

common spaCy errors (URL splitting, abbreviation handling). Default is True.

split_on_colon: Kept for API compatibility (currently unused).

spaCy’s default colon behavior is used. Default is True.

Returns:

List of sentences

Example:

from phrasplit import split_sentences

text = "Dr. Smith is here. She has a Ph.D. in Chemistry."
sentences = split_sentences(text)
# ['Dr. Smith is here.', 'She has a Ph.D. in Chemistry.']

# Disable colon splitting
text = "Note: This is important."
sentences = split_sentences(text, split_on_colon=False)
# ['Note: This is important.']

split_clauses

phrasplit.split_clauses(text: str, language_model: str = 'en_core_web_sm') list[str][source]

Split text into comma-separated parts for audiobook creation.

Uses spaCy for sentence detection, then splits each sentence at commas. The comma stays at the end of each part, creating natural pause points for text-to-speech processing.

Args:

text: Input text language_model: spaCy language model to use

Returns:

List of comma-separated parts

Example:

Input: “I do like coffee, and I like wine.” Output: [“I do like coffee,”, “and I like wine.”]

Example:

from phrasplit import split_clauses

text = "I like coffee, and I like tea."
clauses = split_clauses(text)
# ['I like coffee,', 'and I like tea.']

split_paragraphs

phrasplit.split_paragraphs(text: str) list[str][source]

Split text into paragraphs (separated by double newlines).

Applies preprocessing to fix hyphenated line breaks and normalize whitespace.

Args:

text: Input text

Returns:

List of paragraphs (non-empty, stripped)

Example:

from phrasplit import split_paragraphs

text = "First paragraph.\n\nSecond paragraph."
paragraphs = split_paragraphs(text)
# ['First paragraph.', 'Second paragraph.']

split_text

phrasplit.split_text(text: str, mode: str = 'sentence', language_model: str = 'en_core_web_sm', apply_corrections: bool = True, split_on_colon: bool = True) list[Segment][source]

Split text into segments with hierarchical position information.

This function provides a unified interface for text splitting with different granularity levels, while preserving paragraph and sentence structure information. Useful for audiobook generation where different pause lengths are needed between paragraphs vs. sentences vs. clauses.

Args:

text: Input text to split mode: Splitting mode - one of:

  • “paragraph”: Split into paragraphs only

  • “sentence”: Split into sentences, grouped by paragraph

  • “clause”: Split into clauses (comma-separated), with paragraph and sentence info

language_model: spaCy language model to use (for sentence/clause modes) apply_corrections: Whether to apply post-processing corrections for

common spaCy errors (URL splitting, abbreviation handling). Default is True. Only applies to sentence/clause modes.

split_on_colon: Kept for API compatibility (currently unused).

spaCy’s default colon behavior is used. Default is True.

Returns:
List of Segment namedtuples, each containing:
  • text: The segment text

  • paragraph: Paragraph index (0-based)

  • sentence: Sentence index within paragraph (0-based). None for paragraph mode.

Raises:

ValueError: If mode is not one of “paragraph”, “sentence”, “clause”

Example:
>>> segments = split_text("Hello world. How are you?\n\nNew paragraph.")
>>> for seg in segments:
...     print(f"P{seg.paragraph} S{seg.sentence}: {seg.text}")
P0 S0: Hello world.
P0 S1: How are you?
P1 S0: New paragraph.
>>> # Detect paragraph changes for longer pauses
>>> for i, seg in enumerate(segments):
...     if i > 0 and seg.paragraph != segments[i-1].paragraph:
...         print("--- paragraph break ---")
...     print(seg.text)

Example:

from phrasplit import split_text, Segment

text = "First sentence. Second sentence.\n\nNew paragraph."
segments = split_text(text, mode="sentence")

for seg in segments:
    print(f"P{seg.paragraph} S{seg.sentence}: {seg.text}")
# P0 S0: First sentence.
# P0 S1: Second sentence.
# P1 S0: New paragraph.

# Clause mode for finer granularity
text = "Hello, world.\n\nGoodbye, friend."
segments = split_text(text, mode="clause")
# Returns clauses with paragraph and sentence indices

split_long_lines

phrasplit.split_long_lines(text: str, max_length: int, language_model: str = 'en_core_web_sm') list[str][source]

Split lines exceeding max_length at clause/sentence boundaries.

Strategy: 1. First try to split at sentence boundaries 2. If still too long, split at clause boundaries (commas, semicolons, etc.) 3. If still too long, split at word boundaries

Args:

text: Input text max_length: Maximum line length in characters (must be positive) language_model: spaCy language model to use

Returns:

List of lines, each within max_length (except single words exceeding limit)

Raises:

ValueError: If max_length is less than 1

Example:

from phrasplit import split_long_lines

text = "This is a very long sentence that needs to be split into smaller parts."
lines = split_long_lines(text, max_length=40)

Data Types

Segment

class phrasplit.Segment(text: str, paragraph: int, sentence: int | None = None)[source]

Bases: NamedTuple

A text segment with position information.

Attributes:

text: The text content of the segment paragraph: Paragraph index (0-based) within the document sentence: Sentence index (0-based) within the paragraph.

None for paragraph mode.

text: str

Alias for field number 0

paragraph: int

Alias for field number 1

sentence: int | None

Alias for field number 2

A named tuple representing a text segment with position information.

Fields:

  • text (str): The text content of the segment

  • paragraph (int): Paragraph index (0-based) within the document

  • sentence (int | None): Sentence index (0-based) within the paragraph. None for paragraph mode.

Example:

from phrasplit import split_text, Segment

segments = split_text("Hello world.", mode="sentence")
seg = segments[0]

# Access by name
print(seg.text)       # "Hello world."
print(seg.paragraph)  # 0
print(seg.sentence)   # 0

# Access by index
print(seg[0])  # "Hello world."
print(seg[1])  # 0
print(seg[2])  # 0

# Unpack
text, para, sent = seg

Module Contents

splitter module

Text splitting utilities using spaCy for NLP-based sentence and clause detection.

class phrasplit.splitter.Segment(text: str, paragraph: int, sentence: int | None = None)[source]

Bases: NamedTuple

A text segment with position information.

Attributes:

text: The text content of the segment paragraph: Paragraph index (0-based) within the document sentence: Sentence index (0-based) within the paragraph.

None for paragraph mode.

text: str

Alias for field number 0

paragraph: int

Alias for field number 1

sentence: int | None

Alias for field number 2

phrasplit.splitter.split_paragraphs(text: str) list[str][source]

Split text into paragraphs (separated by double newlines).

Applies preprocessing to fix hyphenated line breaks and normalize whitespace.

Args:

text: Input text

Returns:

List of paragraphs (non-empty, stripped)

phrasplit.splitter.split_sentences(text: str, language_model: str = 'en_core_web_sm', apply_corrections: bool = True, split_on_colon: bool = True) list[str][source]

Split text into sentences using spaCy.

Args:

text: Input text language_model: spaCy language model to use apply_corrections: Whether to apply post-processing corrections for

common spaCy errors (URL splitting, abbreviation handling). Default is True.

split_on_colon: Kept for API compatibility (currently unused).

spaCy’s default colon behavior is used. Default is True.

Returns:

List of sentences

phrasplit.splitter.split_clauses(text: str, language_model: str = 'en_core_web_sm') list[str][source]

Split text into comma-separated parts for audiobook creation.

Uses spaCy for sentence detection, then splits each sentence at commas. The comma stays at the end of each part, creating natural pause points for text-to-speech processing.

Args:

text: Input text language_model: spaCy language model to use

Returns:

List of comma-separated parts

Example:

Input: “I do like coffee, and I like wine.” Output: [“I do like coffee,”, “and I like wine.”]

phrasplit.splitter.split_long_lines(text: str, max_length: int, language_model: str = 'en_core_web_sm') list[str][source]

Split lines exceeding max_length at clause/sentence boundaries.

Strategy: 1. First try to split at sentence boundaries 2. If still too long, split at clause boundaries (commas, semicolons, etc.) 3. If still too long, split at word boundaries

Args:

text: Input text max_length: Maximum line length in characters (must be positive) language_model: spaCy language model to use

Returns:

List of lines, each within max_length (except single words exceeding limit)

Raises:

ValueError: If max_length is less than 1

phrasplit.splitter.split_text(text: str, mode: str = 'sentence', language_model: str = 'en_core_web_sm', apply_corrections: bool = True, split_on_colon: bool = True) list[Segment][source]

Split text into segments with hierarchical position information.

This function provides a unified interface for text splitting with different granularity levels, while preserving paragraph and sentence structure information. Useful for audiobook generation where different pause lengths are needed between paragraphs vs. sentences vs. clauses.

Args:

text: Input text to split mode: Splitting mode - one of:

  • “paragraph”: Split into paragraphs only

  • “sentence”: Split into sentences, grouped by paragraph

  • “clause”: Split into clauses (comma-separated), with paragraph and sentence info

language_model: spaCy language model to use (for sentence/clause modes) apply_corrections: Whether to apply post-processing corrections for

common spaCy errors (URL splitting, abbreviation handling). Default is True. Only applies to sentence/clause modes.

split_on_colon: Kept for API compatibility (currently unused).

spaCy’s default colon behavior is used. Default is True.

Returns:
List of Segment namedtuples, each containing:
  • text: The segment text

  • paragraph: Paragraph index (0-based)

  • sentence: Sentence index within paragraph (0-based). None for paragraph mode.

Raises:

ValueError: If mode is not one of “paragraph”, “sentence”, “clause”

Example:
>>> segments = split_text("Hello world. How are you?\n\nNew paragraph.")
>>> for seg in segments:
...     print(f"P{seg.paragraph} S{seg.sentence}: {seg.text}")
P0 S0: Hello world.
P0 S1: How are you?
P1 S0: New paragraph.
>>> # Detect paragraph changes for longer pauses
>>> for i, seg in enumerate(segments):
...     if i > 0 and seg.paragraph != segments[i-1].paragraph:
...         print("--- paragraph break ---")
...     print(seg.text)

Type Information

phrasplit is fully typed and includes a py.typed marker file for PEP 561 compliance. You can use it with mypy and other type checkers.

Function signatures:

from typing import NamedTuple

class Segment(NamedTuple):
    text: str
    paragraph: int
    sentence: int | None = None

def split_sentences(
    text: str,
    language_model: str = "en_core_web_sm",
    apply_corrections: bool = True,
    split_on_colon: bool = True,
) -> list[str]: ...

def split_clauses(
    text: str,
    language_model: str = "en_core_web_sm",
) -> list[str]: ...

def split_paragraphs(text: str) -> list[str]: ...

def split_text(
    text: str,
    mode: str = "sentence",
    language_model: str = "en_core_web_sm",
    apply_corrections: bool = True,
    split_on_colon: bool = True,
) -> list[Segment]: ...

def split_long_lines(
    text: str,
    max_length: int,
    language_model: str = "en_core_web_sm",
) -> list[str]: ...