contentanalysis

Comprehensive Tools for Scientific Content Analysis in R

Overview

contentanalysis is an R package that provides comprehensive tools for extracting and analyzing scientific content from PDF documents. It enables researchers to perform sophisticated content analysis, citation extraction, network visualization, and text mining on academic papers.

πŸ†• New: AI-Enhanced PDF Import

The package now supports AI-assisted PDF text extraction through Google’s Gemini API, enabling more accurate parsing of complex document layouts. To use this feature, obtain an API key from Google AI Studio.

Key Features

πŸ“„ PDF Processing

  • AI-enhanced text extraction πŸ†•
  • Multi-column PDF support
  • Automatic section detection
  • Structure preservation
  • Complex layout handling

πŸ“š Citation Analysis

  • Multiple citation format detection
  • Enhanced numeric citation support πŸ†•
  • Improved citation matching πŸ†•
  • Automatic reference matching
  • Citation context extraction
  • CrossRef & OpenAlex integration πŸ†•

πŸ•ΈοΈ Network Visualization

  • Interactive citation networks
  • Co-occurrence analysis
  • Section-based coloring
  • Hub citation detection

πŸ“Š Text Analytics

  • Word frequency analysis
  • N-gram extraction
  • Readability indices
  • Distribution tracking

Quick Start

Installation

# Install from GitHub
# install.packages("devtools")
devtools::install_github("massimoaria/contentanalysis")

Basic Usage

library(contentanalysis)

# Import PDF with automatic section detection
doc <- pdf2txt_auto("paper.pdf", n_columns = 2)

# Comprehensive analysis with CrossRef integration
analysis <- analyze_scientific_content(
  text = doc,
  doi = "10.1016/j.xxx.xxx",
  mailto = "your@email.com"
)

# Create interactive citation network
network <- create_citation_network(
  analysis,
  max_distance = 800,
  min_connections = 2
)

Example Output

The package provides rich analytical outputs:

# Summary statistics
analysis$summary

# Citation extraction
head(analysis$citations)

# Word frequencies
head(analysis$word_frequencies, 20)

# Network statistics
attr(network, "stats")

Main Components

1. Content Extraction

Import PDFs and automatically detect document structure:

  • pdf2txt_auto(): Import with automatic section detection
  • AI-enhanced extraction with gemini_content_ai() πŸ†•
  • Supports single, double, and triple-column layouts
  • Identifies Abstract, Introduction, Methods, Results, Discussion, etc.
  • Handles complex layouts with Google Gemini AI support

2. Citation Analysis

Extract and analyze citations:

  • analyze_scientific_content(): Comprehensive content analysis
  • Detects narrative, parenthetical, and numeric citations
  • Enhanced matching algorithms with confidence scoring πŸ†•
  • Matches citations to references with high accuracy
  • Extracts citation contexts
  • Integrated OpenAlex metadata enrichment πŸ†•

3. Network Visualization

Visualize citation relationships:

  • create_citation_network(): Interactive network graphs
  • Shows co-occurrence patterns
  • Color-coded by document sections
  • Edge weights based on citation proximity

4. Text Analytics

Analyze document text:

  • calculate_word_distribution(): Track term usage
  • calculate_readability_indices(): Readability metrics
  • N-gram analysis
  • Lexical diversity measures

Use Cases

The package is designed for:

  • Systematic Literature Reviews: Analyze citation patterns across multiple papers
  • Content Analysis: Extract and quantify key concepts and themes
  • Bibliometric Studies: Map citation networks and identify influential works
  • Academic Writing Analysis: Assess readability and citation practices
  • Research Methodology: Study how citations are deployed in scientific arguments

Getting Help

Citation

If you use contentanalysis in your research, please cite:

@Manual{contentanalysis,
  title = {contentanalysis: Scientific Content and Citation Analysis from PDF Documents},
  author = {Massimo Aria},
  year = {2025},
  note = {R package version 0.1.0},
  url = {https://github.com/massimoaria/contentanalysis}
}

License

GPL-3 Β© 2024