Usage Guide =========== This guide covers how to use phrasplit's Python API for text splitting. Splitting Sentences ------------------- The :func:`~phrasplit.split_sentences` function uses spaCy's NLP pipeline to intelligently detect sentence boundaries: .. code-block:: python from phrasplit import split_sentences text = "Dr. Smith is here. She has a Ph.D. in Chemistry." sentences = split_sentences(text) print(sentences) # ['Dr. Smith is here.', 'She has a Ph.D. in Chemistry.'] The function correctly handles: - **Abbreviations**: Mr., Mrs., Dr., Prof., etc. - **Acronyms**: U.S.A., U.K., etc. - **Titles**: Ph.D., M.D., etc. - **URLs**: www.example.com - **Ellipses**: Text with... ellipses Example with abbreviations: .. code-block:: python text = "Mr. Brown met Prof. Green. They discussed the U.S.A. case." sentences = split_sentences(text) # ['Mr. Brown met Prof. Green.', 'They discussed the U.S.A. case.'] Colon Splitting ^^^^^^^^^^^^^^^ By default, colons are treated as sentence terminators. This is useful for news-style text: .. code-block:: python text = "Breaking News: The event has started." sentences = split_sentences(text) # ['Breaking News:', 'The event has started.'] # Disable colon splitting if needed sentences = split_sentences(text, split_on_colon=False) # ['Breaking News: The event has started.'] Splitting Clauses ----------------- The :func:`~phrasplit.split_clauses` function splits text at commas, creating natural pause points ideal for audiobook and text-to-speech applications: .. code-block:: python from phrasplit import split_clauses text = "I like coffee, and I like tea." clauses = split_clauses(text) print(clauses) # ['I like coffee,', 'and I like tea.'] The comma is kept at the end of each clause, preserving the original punctuation. More complex example: .. code-block:: python text = "When the sun rose, the birds began to sing, and the day started." clauses = split_clauses(text) # ['When the sun rose,', 'the birds began to sing,', 'and the day started.'] Splitting Paragraphs -------------------- The :func:`~phrasplit.split_paragraphs` function splits text at double newlines: .. code-block:: python from phrasplit import split_paragraphs text = """First paragraph with some text. Second paragraph with more text. Third paragraph.""" paragraphs = split_paragraphs(text) # ['First paragraph with some text.', # 'Second paragraph with more text.', # 'Third paragraph.'] The function handles multiple blank lines and whitespace-only lines: .. code-block:: python text = "First.\n\n\n\nSecond." # Multiple blank lines paragraphs = split_paragraphs(text) # ['First.', 'Second.'] Hierarchical Splitting with split_text -------------------------------------- The :func:`~phrasplit.split_text` function provides a unified interface for splitting text while preserving paragraph and sentence structure. This is particularly useful for audiobook generation where you need different pause lengths between paragraphs, sentences, and clauses. .. code-block:: python from phrasplit import split_text, Segment text = "First sentence. Second sentence.\n\nNew paragraph here." segments = split_text(text, mode="sentence") for seg in segments: print(f"Paragraph {seg.paragraph}, Sentence {seg.sentence}: {seg.text}") # Paragraph 0, Sentence 0: First sentence. # Paragraph 0, Sentence 1: Second sentence. # Paragraph 1, Sentence 0: New paragraph here. Available Modes ^^^^^^^^^^^^^^^ - ``"paragraph"``: Returns paragraphs only (``sentence`` is None) - ``"sentence"``: Returns sentences with paragraph tracking - ``"clause"``: Returns clauses with paragraph and sentence tracking .. code-block:: python # Paragraph mode segments = split_text(text, mode="paragraph") # Each segment has sentence=None # Sentence mode (default) segments = split_text(text, mode="sentence") # Each segment has paragraph and sentence indices # Clause mode - finest granularity text = "Hello, world. Goodbye, friend." segments = split_text(text, mode="clause") # Returns: Hello, | world. | Goodbye, | friend. # All with paragraph and sentence tracking Detecting Structure Changes ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use the ``Segment`` fields to detect when paragraphs or sentences change: .. code-block:: python from phrasplit import split_text text = "Sent 1. Sent 2.\n\nSent 3." segments = split_text(text, mode="sentence") for i, seg in enumerate(segments): if i > 0 and seg.paragraph != segments[i-1].paragraph: print("--- New Paragraph ---") print(seg.text) Splitting Long Lines -------------------- The :func:`~phrasplit.split_long_lines` function breaks long lines at natural boundaries (sentences and clauses) to fit within a maximum length: .. code-block:: python from phrasplit import split_long_lines text = "This is a very long sentence. This is another sentence that makes it even longer." lines = split_long_lines(text, max_length=40) # Each line will be <= 40 characters when possible The splitting strategy: 1. First, try to split at sentence boundaries 2. If still too long, split at clause boundaries (commas) 3. If still too long, split at word boundaries Using Different Language Models ------------------------------- All functions that use spaCy accept a ``language_model`` parameter: .. code-block:: python from phrasplit import split_sentences # Use a larger, more accurate model sentences = split_sentences(text, language_model="en_core_web_lg") # Use a model for another language sentences = split_sentences(german_text, language_model="de_core_news_sm") Make sure to download the model first: .. code-block:: bash python -m spacy download de_core_news_sm Processing Pipeline Example --------------------------- Here's a complete example of processing a document: .. code-block:: python from phrasplit import split_paragraphs, split_sentences, split_clauses def process_document(text): """Process a document into structured parts.""" result = [] for para_idx, paragraph in enumerate(split_paragraphs(text)): para_data = {"paragraph": para_idx + 1, "sentences": []} for sent_idx, sentence in enumerate(split_sentences(paragraph)): sent_data = { "sentence": sent_idx + 1, "text": sentence, "clauses": split_clauses(sentence) } para_data["sentences"].append(sent_data) result.append(para_data) return result # Example usage text = """Hello world, this is a test. Another sentence here. Second paragraph with more content, and some clauses.""" structure = process_document(text) Simplified Pipeline with split_text ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The same can be achieved more simply with :func:`~phrasplit.split_text`: .. code-block:: python from phrasplit import split_text text = """Hello world, this is a test. Another sentence here. Second paragraph with more content, and some clauses.""" # Get all clauses with full structure information segments = split_text(text, mode="clause") for seg in segments: print(f"P{seg.paragraph} S{seg.sentence}: {seg.text}")