PDF Import Functions

Overview

The PDF import functions handle the extraction of text from PDF documents while preserving document structure. The package provides automatic section detection, supports various column layouts, and now includes AI-enhanced extraction through Google’s Gemini API for handling complex document layouts.

🆕 AI-Enhanced Extraction

The package now supports AI-assisted PDF text extraction for improved accuracy with complex layouts. This feature uses Google’s Gemini API to better parse:

  • Multi-column layouts with complex flow
  • Documents with tables and figures
  • Papers with unusual formatting
  • Scanned or low-quality PDFs

To use AI features, you need a Gemini API key from Google AI Studio.

Main Function

pdf2txt_auto()

Import PDF with automatic section detection and multi-column support. Now includes optional AI enhancement.

Usage

pdf2txt_auto(
  file_path,
  n_columns = 1,
  sections = TRUE,
  section_keywords = NULL,
  use_ai = FALSE,        # New: Enable AI processing
  ai_model = "2.0-flash" # New: Gemini model version
)

Arguments

  • file_path: Character string. Path to the PDF file
  • n_columns: Integer. Number of columns in the PDF layout (1, 2, or 3)
  • sections: Logical. Whether to split text into sections
  • section_keywords: Character vector. Custom keywords for section detection
  • use_ai: Logical. Whether to use AI-enhanced extraction (default: FALSE) 🆕
  • ai_model: Character. Gemini model version to use (default: “2.0-flash”) 🆕

Value

A named list containing:

  • Full_text: The complete document text
  • Section-specific text (e.g., Abstract, Introduction, Methods, etc.)

New AI Functions 🆕

gemini_content_ai()

Process PDFs and images with Google Gemini AI for enhanced content extraction.

Usage

gemini_content_ai(
  image = NULL,
  docs = NULL,
  prompt = "Explain these images",
  model = "2.0-flash",
  image_type = "png",
  retry_503 = 5,
  api_key = NULL,
  outputSize = "medium"
)

Arguments

  • image: Character vector. Path(s) to image file(s)
  • docs: Character vector. Path(s) to document file(s) (PDF, TXT, HTML, CSV, RTF)
  • prompt: Character. The prompt/instruction for the AI model
  • model: Character. Gemini model version (“1.5-flash”, “2.0-flash”, “2.5-flash”)
  • image_type: Character. Image MIME type (default: “png”)
  • retry_503: Integer. Retry attempts for HTTP 503 errors (default: 5)
  • api_key: Character. Gemini API key (uses GEMINI_API_KEY env var if NULL)
  • outputSize: Character. Output token limit: “small” (8K), “medium” (16K), “large” (32K), “huge” (131K)

Value

Character vector containing the AI-generated response(s).

process_large_pdf() 🆕

Process large PDFs in chunks using AI for better handling of lengthy documents.

Usage

process_large_pdf(
  file_path,
  chunk_pages = 10,
  ai_model = "2.0-flash",
  prompt = "Extract and structure the text from this document",
  api_key = NULL
)

Arguments

  • file_path: Character string. Path to the PDF file
  • chunk_pages: Integer. Number of pages to process per chunk (default: 10)
  • ai_model: Character. Gemini model version to use
  • prompt: Character. Instruction for the AI model
  • api_key: Character. Gemini API key

Value

A named list with text extracted from all chunks, merged together.

Examples

Basic Import

Import a simple single-column PDF:

library(contentanalysis)

# Single column PDF
doc <- pdf2txt_auto("paper.pdf", n_columns = 1)

# Check structure
names(doc)
str(doc, max.level = 1)

Multi-Column PDFs

Most academic papers use two-column layouts:

# Two-column layout (most common)
doc <- pdf2txt_auto("paper.pdf", n_columns = 2)

# Three-column layout (less common)
doc_three <- pdf2txt_auto("paper.pdf", n_columns = 3)

Section Detection

The function automatically detects common academic sections:

# With automatic section detection (default)
doc <- pdf2txt_auto("paper.pdf", n_columns = 2, sections = TRUE)

# View detected sections
names(doc)
# [1] "Full_text"    "Abstract"     "Introduction" "Methods"     
# [5] "Results"      "Discussion"   "References"

# Access specific sections
cat(doc$Abstract)
cat(doc$Introduction)

Without Section Splitting

Extract text without section detection:

# Get only full text
text_only <- pdf2txt_auto("paper.pdf", sections = FALSE)

# Result is a list with only Full_text
names(text_only)
# [1] "Full_text"

AI-Enhanced Import 🆕

Use AI for complex or difficult PDFs:

# Set API key (if not in .Renviron)
Sys.setenv(GEMINI_API_KEY = "your-api-key-here")

# Basic AI-enhanced extraction
doc_ai <- pdf2txt_auto(
  "complex_paper.pdf",
  n_columns = 2,
  use_ai = TRUE,
  ai_model = "2.0-flash"
)

# Process a large PDF in chunks
large_doc <- process_large_pdf(
  "long_paper.pdf",
  chunk_pages = 10,
  ai_model = "2.0-flash"
)

# Direct AI extraction with custom prompt
result <- gemini_content_ai(
  docs = "paper.pdf",
  prompt = "Extract all section headings and their content from this paper",
  outputSize = "large"
)

# Process multiple documents
results <- gemini_content_ai(
  docs = c("paper1.pdf", "paper2.pdf"),
  prompt = "Summarize the main findings",
  model = "2.0-flash"
)

Custom Section Keywords

Define custom keywords for section detection:

# Custom section keywords
my_keywords <- list(
  Background = c("background", "literature review"),
  Methodology = c("methodology", "experimental design"),
  Findings = c("findings", "observations"),
  Conclusions = c("conclusions", "summary")
)

doc <- pdf2txt_auto(
  "paper.pdf",
  n_columns = 2,
  sections = TRUE,
  section_keywords = my_keywords
)

names(doc)

Working with Results

Accessing Section Content

# Get word count per section
section_lengths <- sapply(doc[names(doc) != "Full_text"], function(x) {
  length(strsplit(x, "\\s+")[[1]])
})

print(section_lengths)

# Example output:
#    Abstract Introduction      Methods      Results   Discussion   References 
#         234         1052          876         1342          945          523

Checking Section Quality

# Check if important sections were detected
required_sections <- c("Abstract", "Introduction", "Methods", "Results")
detected <- required_sections %in% names(doc)
names(detected) <- required_sections

print(detected)

# Identify missing sections
missing <- required_sections[!detected]
if (length(missing) > 0) {
  cat("Warning: Missing sections:", paste(missing, collapse = ", "), "\n")
}

Preview Section Content

# Preview first 500 characters of each section
preview_sections <- function(doc, n_chars = 500) {
  sections <- names(doc)[names(doc) != "Full_text"]
  
  for (section in sections) {
    cat("\n===", section, "===\n")
    cat(substr(doc[[section]], 1, n_chars), "...\n")
  }
}

preview_sections(doc)

Common Use Cases

Use Case 1: Prepare for Analysis

# Import PDF
doc <- pdf2txt_auto("example_paper.pdf", n_columns = 2)

# Verify sections were detected
if (!"Introduction" %in% names(doc)) {
  warning("Introduction not detected - check section keywords")
}

# Pass to analysis
analysis <- analyze_scientific_content(
  text = doc,
  doi = "10.xxxx/xxxxx",
  mailto = "your@email.com"
)

Use Case 2: Extract Specific Sections

# Import multiple papers and extract only Methods
papers <- c("paper1.pdf", "paper2.pdf", "paper3.pdf")

methods_sections <- lapply(papers, function(p) {
  doc <- pdf2txt_auto(p, n_columns = 2)
  doc$Methods
})

names(methods_sections) <- papers

# Analyze methods section lengths
methods_lengths <- sapply(methods_sections, function(m) {
  if (!is.null(m)) {
    length(strsplit(m, "\\s+")[[1]])
  } else {
    NA
  }
})

print(methods_lengths)

Use Case 3: Batch Processing

# Process multiple papers with different layouts
process_paper <- function(file, columns = 2) {
  tryCatch({
    doc <- pdf2txt_auto(file, n_columns = columns)
    list(
      success = TRUE,
      sections = names(doc),
      word_count = length(strsplit(doc$Full_text, "\\s+")[[1]]),
      doc = doc
    )
  }, error = function(e) {
    list(
      success = FALSE,
      error = e$message,
      doc = NULL
    )
  })
}

# Process all PDFs in directory
pdf_files <- list.files(pattern = "\\.pdf$")
results <- lapply(pdf_files, process_paper)
names(results) <- pdf_files

# Check success rate
success_rate <- mean(sapply(results, function(r) r$success))
cat("Successfully processed:", success_rate * 100, "%\n")

Tips and Best Practices

Column Detection
  • Start with n_columns = 2: Most academic papers use two columns
  • If text appears jumbled, try different column values
  • Single-column: theses, reports, preprints
  • Two-column: journal articles, conference papers
  • Three-column: some conference proceedings
Section Detection
  • Works best with clearly formatted section headers
  • Case-insensitive matching
  • Handles common variations (e.g., “METHODS”, “Methods”, “Methodology”)
  • Custom keywords useful for non-standard formats
Limitations
  • Requires text-based PDFs (not scanned images)
  • Complex layouts may affect text extraction
  • Tables and figures may disrupt text flow
  • Some formatting is lost in plain text conversion

Troubleshooting

Problem: Garbled or Jumbled Text

Solution: Adjust the n_columns parameter

# Try different column settings
doc_1col <- pdf2txt_auto("paper.pdf", n_columns = 1)
doc_2col <- pdf2txt_auto("paper.pdf", n_columns = 2)
doc_3col <- pdf2txt_auto("paper.pdf", n_columns = 3)

# Compare results
sapply(list(doc_1col, doc_2col, doc_3col), function(d) {
  substr(d$Full_text, 1, 200)
})

Problem: Sections Not Detected

Solution: Check section headers and use custom keywords

# Import without sections to see raw text
doc_raw <- pdf2txt_auto("paper.pdf", sections = FALSE)

# Search for section markers
section_markers <- c("abstract", "introduction", "methods", 
                     "results", "discussion", "conclusion")

positions <- sapply(section_markers, function(marker) {
  gregexpr(marker, doc_raw$Full_text, ignore.case = TRUE)[[1]][1]
})

print(positions[positions > 0])

Problem: Missing References Section

Solution: References may be in a separate section or differently named

# Check all detected sections
names(doc)

# References might be under different names
possible_refs <- c("References", "Bibliography", "Works Cited", "Literature")
ref_section <- intersect(names(doc), possible_refs)

if (length(ref_section) > 0) {
  references <- doc[[ref_section]]
} else {
  cat("References section not found automatically\n")
}

See Also