# Summary statistics
analysis$summary
# Citation extraction
head(analysis$citations)
# Word frequencies
head(analysis$word_frequencies, 20)
# Network statistics
attr(network, "stats")contentanalysis
Comprehensive Tools for Scientific Content Analysis in R
Overview
contentanalysis is an R package that provides comprehensive tools for extracting and analyzing scientific content from PDF documents. It enables researchers to perform sophisticated content analysis, citation extraction, network visualization, and text mining on academic papers.
The package now supports AI-assisted PDF text extraction through Googleβs Gemini API, enabling more accurate parsing of complex document layouts. To use this feature, obtain an API key from Google AI Studio.
Key Features
π PDF Processing
- AI-enhanced text extraction π
- Multi-column PDF support
- Automatic section detection
- Structure preservation
- Complex layout handling
π Citation Analysis
- Multiple citation format detection
- Enhanced numeric citation support π
- Improved citation matching π
- Automatic reference matching
- Citation context extraction
- CrossRef & OpenAlex integration π
πΈοΈ Network Visualization
- Interactive citation networks
- Co-occurrence analysis
- Section-based coloring
- Hub citation detection
π Text Analytics
- Word frequency analysis
- N-gram extraction
- Readability indices
- Distribution tracking
Quick Start
Installation
# Install from GitHub
# install.packages("devtools")
devtools::install_github("massimoaria/contentanalysis")Basic Usage
library(contentanalysis)
# Import PDF with automatic section detection
doc <- pdf2txt_auto("paper.pdf", n_columns = 2)
# Comprehensive analysis with CrossRef integration
analysis <- analyze_scientific_content(
text = doc,
doi = "10.1016/j.xxx.xxx",
mailto = "your@email.com"
)
# Create interactive citation network
network <- create_citation_network(
analysis,
max_distance = 800,
min_connections = 2
)Example Output
The package provides rich analytical outputs:
Main Components
1. Content Extraction
Import PDFs and automatically detect document structure:
- pdf2txt_auto(): Import with automatic section detection
- AI-enhanced extraction with gemini_content_ai() π
- Supports single, double, and triple-column layouts
- Identifies Abstract, Introduction, Methods, Results, Discussion, etc.
- Handles complex layouts with Google Gemini AI support
2. Citation Analysis
Extract and analyze citations:
- analyze_scientific_content(): Comprehensive content analysis
- Detects narrative, parenthetical, and numeric citations
- Enhanced matching algorithms with confidence scoring π
- Matches citations to references with high accuracy
- Extracts citation contexts
- Integrated OpenAlex metadata enrichment π
3. Network Visualization
Visualize citation relationships:
- create_citation_network(): Interactive network graphs
- Shows co-occurrence patterns
- Color-coded by document sections
- Edge weights based on citation proximity
4. Text Analytics
Analyze document text:
- calculate_word_distribution(): Track term usage
- calculate_readability_indices(): Readability metrics
- N-gram analysis
- Lexical diversity measures
Use Cases
The package is designed for:
- Systematic Literature Reviews: Analyze citation patterns across multiple papers
- Content Analysis: Extract and quantify key concepts and themes
- Bibliometric Studies: Map citation networks and identify influential works
- Academic Writing Analysis: Assess readability and citation practices
- Research Methodology: Study how citations are deployed in scientific arguments
Getting Help
- Get Started Guide: Step-by-step introduction
- Reference Documentation: Detailed function documentation
- Tutorial: Complete workflow examples
- GitHub Issues: Report bugs or request features
Citation
If you use contentanalysis in your research, please cite:
@Manual{contentanalysis,
title = {contentanalysis: Scientific Content and Citation Analysis from PDF Documents},
author = {Massimo Aria},
year = {2025},
note = {R package version 0.1.0},
url = {https://github.com/massimoaria/contentanalysis}
}License
GPL-3 Β© 2024