Organizing Stock Documents for Form 8949 Using Google Colab
During income tax season, one of the most time-consuming tasks is organizing stock-related documents provided by clients. Some clients may have over five stock accounts for both spouses, and others may have a single account with a high volume of transactions.
From these stock documents, the paperwork required for Form 8949 must be identified and compiled into a separate 8949 file. This process can be extremely time-consuming and prone to errors, especially when handling large amounts of data.
Colab Code
To streamline this task and reduce the risk of mistakes, I developed a script using Google Colab. Below is the step-by-step breakdown of how the script works:
- Installing Libraries and Importing Files:
First, the necessary libraries are installed, and the stock documents requiring editing are imported into Colab. (See Block 1 and 2 in the code.) - Combining Files with New Page Numbers:
Since multiple stock accounts must be combined into a single file (including accounts for both spouses), the script updates the page numbers in the combined file for easier organization. (See Block 3 in the code.) - Filtering and Extracting Specific Pages:
The script identifies pages that meet specific conditions:- Pages that must include certain text (
included_...
). - Pages that must not include other specific text (
excluded_...
).
These filtered pages are compiled into a new file. (See Block 4 in the code.)
- Pages that must include certain text (
- Compiling Remaining Pages:
Pages that were not extracted are grouped into a separate “remaining” file.(See Block 5 in the code.) - Validation Check:
To ensure the process was completed accurately, the script checks the “remaining” file to confirm that no required pages were missed.
This script isn’t limited to stock-related documents; it can also be customized for other tasks by modifying the text conditions (highlighted in red in Block 4).
* I want to emphasize that I am not a coding expert, and this code was created specifically to achieve the results I needed. It’s not a professional-level solution, and there may be better methods out there. However, this approach is designed for people who, like me, aren’t well-versed in coding but still want to simplify repetitive tasks.
Block 1
!pip install reportlab PyPDF2
Block 2
from google.colab import files
uploaded = files.upload()
Block 3
import PyPDF2
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from PyPDF2 import PdfReader, PdfWriter
import io
def add_page_numbers(input_pdf_file, output_pdf_file):
reader = PdfReader(input_pdf_file)
writer = PdfWriter()
# Create a new PDF with page numbers
packet = io.BytesIO()
c = canvas.Canvas(packet, pagesize=letter)
# Add page numbers to each page
for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
# Draw the page number at the bottom of each page
c.drawString(500, 10, f"Page {page_num + 1}")
c.showPage() # Finish this page
c.save()
# Move the buffer to the beginning
packet.seek(0)
new_pdf = PdfReader(packet)
# Merge the new page number PDF with the original PDF
for i in range(len(reader.pages)):
page = reader.pages[i]
page.merge_page(new_pdf.pages[i])
writer.add_page(page)
# Save the resulting PDF with page numbers
with open(output_pdf_file, "wb") as output:
writer.write(output)
print(f"Page numbers added. New PDF saved as {output_pdf_file}")
# Example usage
input_pdf_file = next(iter(uploaded)) # Use the uploaded file
output_pdf_file_with_numbers = 'output_with_page_numbers.pdf'
# Add page numbers
add_page_numbers(input_pdf_file, output_pdf_file_with_numbers) #########################
files.download(output_pdf_file_with_numbers)
Block 4
def extract_pages_with_characters(pdf_file, included_chars, excluded_chars):
with open(pdf_file, 'rb') as file:
pdf = PyPDF2.PdfReader(file)
num_pages = len(pdf.pages)
pages_with_included_chars = []
pages_without_excluded_chars = []
for page_num in range(num_pages):
page = pdf.pages[page_num]
text = page.extract_text()
# Check for inclusion criteria
if any(char in text for char in included_chars):
pages_with_included_chars.append(page_num + 1) # Page numbers start from 1, not 0
# Check for exclusion criteria
if not any(char in text for char in excluded_chars):
pages_without_excluded_chars.append(page_num + 1)
return list(set(pages_with_included_chars) & set(pages_without_excluded_chars))
def extract_pages_to_new_pdf(pdf_file, pages_to_extract, output_file):
with open(pdf_file, 'rb') as file:
pdf = PyPDF2.PdfReader(file)
pdf_writer = PyPDF2.PdfWriter()
# Sort the pages before adding them to the PdfWriter
sorted_pages = sorted(pages_to_extract)
for page_num in sorted_pages:
pdf_writer.add_page(pdf.pages[page_num - 1]) # Adjust index for 0-based pages
with open(output_file, 'wb') as output:
pdf_writer.write(output)
with open(input_pdf_file, 'rb') as f:
content = f.read()
output_pdf_file = 'Form8949ExceptionReportingStatement.pdf'
included_characters = ['8949,','8949','Form 8949,','Form 8949','Form 1099-MISC','Tax Summary'] #,'Form 1099-B'] #'1099-B'] # Characters to include
excluded_characters = ['1099-INT', '1099-DIV','1099-INT*', '1099-DIV*', '1099-C', 'INSTRUCTIONS', 'Instructions'] # Characters to exclude
pages_to_extract = extract_pages_with_characters(input_pdf_file, included_characters, excluded_characters)
print("pages",pages_to_extract )
if pages_to_extract:
extract_pages_to_new_pdf(input_pdf_file, pages_to_extract, output_pdf_file)
print(f"Pages with included characters and without excluded characters have been extracted to '{output_pdf_file}'.")
else:
print("No pages found matching the criteria.")
# Trigger the download
files.download(output_pdf_file)
Block 5
input_pdf_file = output_pdf_file_with_numbers
excluded_pages = pages_to_extract
# 지정된 페이지를 제외한 나머지 페이지를 추출하는 함수
def extract_other_pages(pdf_file, excluded_pages, output_file):
with open(pdf_file, 'rb') as file:
pdf = PyPDF2.PdfReader(file)
pdf_writer = PyPDF2.PdfWriter()
num_pages = len(pdf.pages)
all_pages = set(range(1, num_pages + 1)) # 모든 페이지 번호 (1-based index)
included_pages = sorted(all_pages - set(excluded_pages)) # 제외된 페이지를 뺀 나머지
for page_num in included_pages:
pdf_writer.add_page(pdf.pages[page_num - 1]) # 0-based index
with open(output_file, 'wb') as output:
pdf_writer.write(output)
return included_pages
# 제외할 페이지 리스트
# 제외되지 않은 페이지를 추출하여 저장
output_pdf_file = 'remaining_pages.pdf'
remaining_pages = extract_other_pages(input_pdf_file, excluded_pages, output_pdf_file)
# 결과 출력 및 다운로드
print("Remaining Pages: ", remaining_pages)
if remaining_pages:
print(f"Pages excluding {excluded_pages} have been saved to '{output_pdf_file}'.")
files.download(output_pdf_file)
else:
print("No pages remaining after excluding the specified pages.")
Youtube
You can find a detailed explanation and step-by-step guide in the YouTube video linked below.