Case Study: Extracting Pdf pages

Organizing Stock Documents for Form 8949 Using Google Colab

During income tax season, one of the most time-consuming tasks is organizing stock-related documents provided by clients. Some clients may have over five stock accounts for both spouses, and others may have a single account with a high volume of transactions.

From these stock documents, the paperwork required for Form 8949 must be identified and compiled into a separate 8949 file. This process can be extremely time-consuming and prone to errors, especially when handling large amounts of data.

Colab Code

To streamline this task and reduce the risk of mistakes, I developed a script using Google Colab. Below is the step-by-step breakdown of how the script works:

Installing Libraries and Importing Files:
First, the necessary libraries are installed, and the stock documents requiring editing are imported into Colab. (See Block 1 and 2 in the code.)
Combining Files with New Page Numbers:
Since multiple stock accounts must be combined into a single file (including accounts for both spouses), the script updates the page numbers in the combined file for easier organization. (See Block 3 in the code.)
Filtering and Extracting Specific Pages:
The script identifies pages that meet specific conditions:
- Pages that must include certain text (included_...).
- Pages that must not include other specific text (excluded_...).
  These filtered pages are compiled into a new file. (See Block 4 in the code.)
Compiling Remaining Pages:
Pages that were not extracted are grouped into a separate “remaining” file.(See Block 5 in the code.)
Validation Check:
To ensure the process was completed accurately, the script checks the “remaining” file to confirm that no required pages were missed.

This script isn’t limited to stock-related documents; it can also be customized for other tasks by modifying the text conditions (highlighted in red in Block 4).

* I want to emphasize that I am not a coding expert, and this code was created specifically to achieve the results I needed. It’s not a professional-level solution, and there may be better methods out there. However, this approach is designed for people who, like me, aren’t well-versed in coding but still want to simplify repetitive tasks.

Block 1

!pip install reportlab PyPDF2

Block 2

from google.colab import files

uploaded = files.upload()

Block 3


import PyPDF2
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from PyPDF2 import PdfReader, PdfWriter
import io

def add_page_numbers(input_pdf_file, output_pdf_file):
    reader = PdfReader(input_pdf_file)
    writer = PdfWriter()

    # Create a new PDF with page numbers
    packet = io.BytesIO()
    c = canvas.Canvas(packet, pagesize=letter)

    # Add page numbers to each page
    for page_num in range(len(reader.pages)):
        page = reader.pages[page_num]

        # Draw the page number at the bottom of each page
        c.drawString(500, 10, f"Page {page_num + 1}")
        c.showPage()  # Finish this page

    c.save()

    # Move the buffer to the beginning
    packet.seek(0)
    new_pdf = PdfReader(packet)

    # Merge the new page number PDF with the original PDF
    for i in range(len(reader.pages)):
        page = reader.pages[i]
        page.merge_page(new_pdf.pages[i])

        writer.add_page(page)

    # Save the resulting PDF with page numbers
    with open(output_pdf_file, "wb") as output:
        writer.write(output)

    print(f"Page numbers added. New PDF saved as {output_pdf_file}")

# Example usage
input_pdf_file = next(iter(uploaded))  # Use the uploaded file
output_pdf_file_with_numbers = 'output_with_page_numbers.pdf'

# Add page numbers
add_page_numbers(input_pdf_file, output_pdf_file_with_numbers)    #########################

files.download(output_pdf_file_with_numbers)

Block 4

def extract_pages_with_characters(pdf_file, included_chars, excluded_chars):
    with open(pdf_file, 'rb') as file:
        pdf = PyPDF2.PdfReader(file)
        num_pages = len(pdf.pages)

        pages_with_included_chars = []
        pages_without_excluded_chars = []

        for page_num in range(num_pages):
            page = pdf.pages[page_num]
            text = page.extract_text()

            # Check for inclusion criteria
            if any(char in text for char in included_chars):
                pages_with_included_chars.append(page_num + 1)  # Page numbers start from 1, not 0

            # Check for exclusion criteria
            if not any(char in text for char in excluded_chars):
                pages_without_excluded_chars.append(page_num + 1)

        return list(set(pages_with_included_chars) & set(pages_without_excluded_chars))

def extract_pages_to_new_pdf(pdf_file, pages_to_extract, output_file):
    with open(pdf_file, 'rb') as file:
        pdf = PyPDF2.PdfReader(file)
        pdf_writer = PyPDF2.PdfWriter()
        # Sort the pages before adding them to the PdfWriter
        sorted_pages = sorted(pages_to_extract)

        for page_num in sorted_pages:
            pdf_writer.add_page(pdf.pages[page_num - 1])  # Adjust index for 0-based pages

        with open(output_file, 'wb') as output:
            pdf_writer.write(output)



with open(input_pdf_file, 'rb') as f:
    content = f.read()

output_pdf_file = 'Form8949ExceptionReportingStatement.pdf'


included_characters = ['8949,','8949','Form 8949,','Form 8949','Form 1099-MISC','Tax Summary'] #,'Form 1099-B'] #'1099-B']  # Characters to include
excluded_characters = ['1099-INT', '1099-DIV','1099-INT*', '1099-DIV*', '1099-C', 'INSTRUCTIONS', 'Instructions']  # Characters to exclude

pages_to_extract = extract_pages_with_characters(input_pdf_file, included_characters, excluded_characters)

print("pages",pages_to_extract )
if pages_to_extract:
    extract_pages_to_new_pdf(input_pdf_file, pages_to_extract, output_pdf_file)
    print(f"Pages with included characters and without excluded characters have been extracted to '{output_pdf_file}'.")
else:
    print("No pages found matching the criteria.")

# Trigger the download
files.download(output_pdf_file)

Block 5

input_pdf_file = output_pdf_file_with_numbers
excluded_pages = pages_to_extract

# 지정된 페이지를 제외한 나머지 페이지를 추출하는 함수
def extract_other_pages(pdf_file, excluded_pages, output_file):
    with open(pdf_file, 'rb') as file:
        pdf = PyPDF2.PdfReader(file)
        pdf_writer = PyPDF2.PdfWriter()

        num_pages = len(pdf.pages)
        all_pages = set(range(1, num_pages + 1))  # 모든 페이지 번호 (1-based index)
        included_pages = sorted(all_pages - set(excluded_pages))  # 제외된 페이지를 뺀 나머지

        for page_num in included_pages:
            pdf_writer.add_page(pdf.pages[page_num - 1])  # 0-based index

        with open(output_file, 'wb') as output:
            pdf_writer.write(output)

    return included_pages


# 제외할 페이지 리스트

# 제외되지 않은 페이지를 추출하여 저장
output_pdf_file = 'remaining_pages.pdf'
remaining_pages = extract_other_pages(input_pdf_file, excluded_pages, output_pdf_file)

# 결과 출력 및 다운로드
print("Remaining Pages: ", remaining_pages)
if remaining_pages:
    print(f"Pages excluding {excluded_pages} have been saved to '{output_pdf_file}'.")
    files.download(output_pdf_file)
else:
    print("No pages remaining after excluding the specified pages.")

Youtube

You can find a detailed explanation and step-by-step guide in the YouTube video linked below.

Case Study: Extracting Pdf pages

Organizing Stock Documents for Form 8949 Using Google Colab

Colab Code

Youtube

Case Study: Editing Pdf Pages – Split a Single Page into Three Pages

Case Study: Bookkeeping

Case Study: Send Bulk Texts to Customers Monthly with Bulk Texter Pro

Case Study: Sending Bulk BOI Notice Emails

Case Study: Send bulk 2024 Income Tax Notice Emails

Organizing Stock Documents for Form 8949 Using Google Colab

Colab Code

Youtube

Similar Posts

Youtube