library(contentanalysis)
# Single column PDF
doc <- pdf2txt_auto("paper.pdf", n_columns = 1)
# Check structure
names(doc)
str(doc, max.level = 1)PDF Import Functions
Overview
The PDF import functions handle the extraction of text from PDF documents while preserving document structure. The package provides automatic section detection, supports various column layouts, and now includes AI-enhanced extraction through Google’s Gemini API for handling complex document layouts.
The package now supports AI-assisted PDF text extraction for improved accuracy with complex layouts. This feature uses Google’s Gemini API to better parse:
- Multi-column layouts with complex flow
- Documents with tables and figures
- Papers with unusual formatting
- Scanned or low-quality PDFs
To use AI features, you need a Gemini API key from Google AI Studio.
Main Function
pdf2txt_auto()
Import PDF with automatic section detection and multi-column support. Now includes optional AI enhancement.
Usage
pdf2txt_auto(
file_path,
n_columns = 1,
sections = TRUE,
section_keywords = NULL,
use_ai = FALSE, # New: Enable AI processing
ai_model = "2.0-flash" # New: Gemini model version
)Arguments
file_path: Character string. Path to the PDF filen_columns: Integer. Number of columns in the PDF layout (1, 2, or 3)sections: Logical. Whether to split text into sectionssection_keywords: Character vector. Custom keywords for section detectionuse_ai: Logical. Whether to use AI-enhanced extraction (default: FALSE) 🆕ai_model: Character. Gemini model version to use (default: “2.0-flash”) 🆕
Value
A named list containing:
Full_text: The complete document text- Section-specific text (e.g.,
Abstract,Introduction,Methods, etc.)
New AI Functions 🆕
gemini_content_ai()
Process PDFs and images with Google Gemini AI for enhanced content extraction.
Usage
gemini_content_ai(
image = NULL,
docs = NULL,
prompt = "Explain these images",
model = "2.0-flash",
image_type = "png",
retry_503 = 5,
api_key = NULL,
outputSize = "medium"
)Arguments
image: Character vector. Path(s) to image file(s)docs: Character vector. Path(s) to document file(s) (PDF, TXT, HTML, CSV, RTF)prompt: Character. The prompt/instruction for the AI modelmodel: Character. Gemini model version (“1.5-flash”, “2.0-flash”, “2.5-flash”)image_type: Character. Image MIME type (default: “png”)retry_503: Integer. Retry attempts for HTTP 503 errors (default: 5)api_key: Character. Gemini API key (uses GEMINI_API_KEY env var if NULL)outputSize: Character. Output token limit: “small” (8K), “medium” (16K), “large” (32K), “huge” (131K)
Value
Character vector containing the AI-generated response(s).
process_large_pdf() 🆕
Process large PDFs in chunks using AI for better handling of lengthy documents.
Usage
process_large_pdf(
file_path,
chunk_pages = 10,
ai_model = "2.0-flash",
prompt = "Extract and structure the text from this document",
api_key = NULL
)Arguments
file_path: Character string. Path to the PDF filechunk_pages: Integer. Number of pages to process per chunk (default: 10)ai_model: Character. Gemini model version to useprompt: Character. Instruction for the AI modelapi_key: Character. Gemini API key
Value
A named list with text extracted from all chunks, merged together.
Examples
Basic Import
Import a simple single-column PDF:
Multi-Column PDFs
Most academic papers use two-column layouts:
# Two-column layout (most common)
doc <- pdf2txt_auto("paper.pdf", n_columns = 2)
# Three-column layout (less common)
doc_three <- pdf2txt_auto("paper.pdf", n_columns = 3)Section Detection
The function automatically detects common academic sections:
# With automatic section detection (default)
doc <- pdf2txt_auto("paper.pdf", n_columns = 2, sections = TRUE)
# View detected sections
names(doc)
# [1] "Full_text" "Abstract" "Introduction" "Methods"
# [5] "Results" "Discussion" "References"
# Access specific sections
cat(doc$Abstract)
cat(doc$Introduction)Without Section Splitting
Extract text without section detection:
# Get only full text
text_only <- pdf2txt_auto("paper.pdf", sections = FALSE)
# Result is a list with only Full_text
names(text_only)
# [1] "Full_text"AI-Enhanced Import 🆕
Use AI for complex or difficult PDFs:
# Set API key (if not in .Renviron)
Sys.setenv(GEMINI_API_KEY = "your-api-key-here")
# Basic AI-enhanced extraction
doc_ai <- pdf2txt_auto(
"complex_paper.pdf",
n_columns = 2,
use_ai = TRUE,
ai_model = "2.0-flash"
)
# Process a large PDF in chunks
large_doc <- process_large_pdf(
"long_paper.pdf",
chunk_pages = 10,
ai_model = "2.0-flash"
)
# Direct AI extraction with custom prompt
result <- gemini_content_ai(
docs = "paper.pdf",
prompt = "Extract all section headings and their content from this paper",
outputSize = "large"
)
# Process multiple documents
results <- gemini_content_ai(
docs = c("paper1.pdf", "paper2.pdf"),
prompt = "Summarize the main findings",
model = "2.0-flash"
)Custom Section Keywords
Define custom keywords for section detection:
# Custom section keywords
my_keywords <- list(
Background = c("background", "literature review"),
Methodology = c("methodology", "experimental design"),
Findings = c("findings", "observations"),
Conclusions = c("conclusions", "summary")
)
doc <- pdf2txt_auto(
"paper.pdf",
n_columns = 2,
sections = TRUE,
section_keywords = my_keywords
)
names(doc)Working with Results
Accessing Section Content
# Get word count per section
section_lengths <- sapply(doc[names(doc) != "Full_text"], function(x) {
length(strsplit(x, "\\s+")[[1]])
})
print(section_lengths)
# Example output:
# Abstract Introduction Methods Results Discussion References
# 234 1052 876 1342 945 523Checking Section Quality
# Check if important sections were detected
required_sections <- c("Abstract", "Introduction", "Methods", "Results")
detected <- required_sections %in% names(doc)
names(detected) <- required_sections
print(detected)
# Identify missing sections
missing <- required_sections[!detected]
if (length(missing) > 0) {
cat("Warning: Missing sections:", paste(missing, collapse = ", "), "\n")
}Preview Section Content
# Preview first 500 characters of each section
preview_sections <- function(doc, n_chars = 500) {
sections <- names(doc)[names(doc) != "Full_text"]
for (section in sections) {
cat("\n===", section, "===\n")
cat(substr(doc[[section]], 1, n_chars), "...\n")
}
}
preview_sections(doc)Common Use Cases
Use Case 1: Prepare for Analysis
# Import PDF
doc <- pdf2txt_auto("example_paper.pdf", n_columns = 2)
# Verify sections were detected
if (!"Introduction" %in% names(doc)) {
warning("Introduction not detected - check section keywords")
}
# Pass to analysis
analysis <- analyze_scientific_content(
text = doc,
doi = "10.xxxx/xxxxx",
mailto = "your@email.com"
)Use Case 2: Extract Specific Sections
# Import multiple papers and extract only Methods
papers <- c("paper1.pdf", "paper2.pdf", "paper3.pdf")
methods_sections <- lapply(papers, function(p) {
doc <- pdf2txt_auto(p, n_columns = 2)
doc$Methods
})
names(methods_sections) <- papers
# Analyze methods section lengths
methods_lengths <- sapply(methods_sections, function(m) {
if (!is.null(m)) {
length(strsplit(m, "\\s+")[[1]])
} else {
NA
}
})
print(methods_lengths)Use Case 3: Batch Processing
# Process multiple papers with different layouts
process_paper <- function(file, columns = 2) {
tryCatch({
doc <- pdf2txt_auto(file, n_columns = columns)
list(
success = TRUE,
sections = names(doc),
word_count = length(strsplit(doc$Full_text, "\\s+")[[1]]),
doc = doc
)
}, error = function(e) {
list(
success = FALSE,
error = e$message,
doc = NULL
)
})
}
# Process all PDFs in directory
pdf_files <- list.files(pattern = "\\.pdf$")
results <- lapply(pdf_files, process_paper)
names(results) <- pdf_files
# Check success rate
success_rate <- mean(sapply(results, function(r) r$success))
cat("Successfully processed:", success_rate * 100, "%\n")Tips and Best Practices
- Start with n_columns = 2: Most academic papers use two columns
- If text appears jumbled, try different column values
- Single-column: theses, reports, preprints
- Two-column: journal articles, conference papers
- Three-column: some conference proceedings
- Works best with clearly formatted section headers
- Case-insensitive matching
- Handles common variations (e.g., “METHODS”, “Methods”, “Methodology”)
- Custom keywords useful for non-standard formats
- Requires text-based PDFs (not scanned images)
- Complex layouts may affect text extraction
- Tables and figures may disrupt text flow
- Some formatting is lost in plain text conversion
Troubleshooting
Problem: Garbled or Jumbled Text
Solution: Adjust the n_columns parameter
# Try different column settings
doc_1col <- pdf2txt_auto("paper.pdf", n_columns = 1)
doc_2col <- pdf2txt_auto("paper.pdf", n_columns = 2)
doc_3col <- pdf2txt_auto("paper.pdf", n_columns = 3)
# Compare results
sapply(list(doc_1col, doc_2col, doc_3col), function(d) {
substr(d$Full_text, 1, 200)
})Problem: Sections Not Detected
Solution: Check section headers and use custom keywords
# Import without sections to see raw text
doc_raw <- pdf2txt_auto("paper.pdf", sections = FALSE)
# Search for section markers
section_markers <- c("abstract", "introduction", "methods",
"results", "discussion", "conclusion")
positions <- sapply(section_markers, function(marker) {
gregexpr(marker, doc_raw$Full_text, ignore.case = TRUE)[[1]][1]
})
print(positions[positions > 0])Problem: Missing References Section
Solution: References may be in a separate section or differently named
# Check all detected sections
names(doc)
# References might be under different names
possible_refs <- c("References", "Bibliography", "Works Cited", "Literature")
ref_section <- intersect(names(doc), possible_refs)
if (length(ref_section) > 0) {
references <- doc[[ref_section]]
} else {
cat("References section not found automatically\n")
}See Also
- Content Analysis: Analyze imported documents
- Citation Analysis: Extract citations from text
- Tutorial: Complete workflow examples