About
About contentanalysis
contentanalysis is an R package designed to facilitate comprehensive analysis of scientific literature. It provides researchers with powerful tools to extract, analyze, and visualize content from PDF documents, with a particular focus on citation analysis and text mining.
Key Features
PDF Processing
- AI-enhanced text extraction with Google Gemini π
- Multi-column layout support (1, 2, or 3 columns)
- Automatic section detection (Abstract, Introduction, Methods, Results, Discussion, etc.)
- Structure preservation during text extraction
- Handles complex academic paper layouts
- Process large PDFs in chunks π
Citation Analysis
- Detects multiple citation formats (narrative, parenthetical, and numeric) π
- Enhanced numeric citation recognition (
[1],[1-3],[1,5,7]) π - Extracts citation contexts with configurable window sizes
- Automatic reference matching via CrossRef and OpenAlex π
- Improved matching algorithms with confidence scoring π
- Citation network visualization
- Co-citation analysis
Text Analytics
- Word frequency analysis with stopword removal
- N-gram extraction (bigrams, trigrams, etc.)
- Word distribution tracking across document sections
- Readability metrics (Flesch Reading Ease, Flesch-Kincaid, etc.)
- Lexical diversity measures
Visualization
- Interactive citation networks with visNetwork
- Word distribution plots (line, bar, area)
- Customizable color schemes and layouts
- Publication-ready figures
Use Cases
The package is designed for:
- Systematic Literature Reviews: Analyze citation patterns and content across multiple papers
- Bibliometric Studies: Map citation networks and identify influential works
- Content Analysis Research: Extract and quantify themes, concepts, and terminology
- Academic Writing Assessment: Evaluate readability and citation practices
- Research Methodology: Study how citations are deployed in scientific arguments
Development
Technology Stack
contentanalysis is built using:
- R: Statistical computing environment
- pdftools: PDF text extraction
- Google Gemini API: AI-enhanced text processing π
- rcrossref: CrossRef API integration
- OpenAlex API: Bibliographic metadata enrichment π
- visNetwork: Interactive network visualization
- quanteda: Text analysis and natural language processing
- dplyr/tidyr: Data manipulation
- httr2: HTTP client for AI API communication π
Contributing
We welcome contributions! Hereβs how you can help:
Report Bugs
If you find a bug, please open an issue with:
- A clear description of the problem
- Reproducible example
- Your R version and operating system
- Package version
Suggest Features
Have an idea for improvement? Open an issue describing:
- The feature youβd like to see
- Why it would be useful
- Example use cases
Submit Pull Requests
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Code of Conduct
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
Citation
If you use contentanalysis in your research, please cite:
@Manual{contentanalysis,
title = {contentanalysis: Comprehensive Tools for Scientific Content Analysis},
author = {Massimo Aria},
year = {2024},
note = {R package version 0.1.0},
url = {https://github.com/massimoaria/contentanalysis}
}License
This package is free and open source software, licensed under GPL-3.
GPL-3 License
Copyright (c) 2024 Massimo Aria
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
Acknowledgments
Special thanks to:
- The R Core Team for the R language
- Contributors to all dependent packages
- The open science community
- Users who provide feedback and suggestions
Contact
Version History
Version 0.1.0 (Current)
Recent updates (October 2025):
- AI-enhanced PDF extraction: Integration with Google Gemini API for improved text extraction from complex layouts
- Improved citation matching: Enhanced algorithms for matching citations to references
- OpenAlex integration: Dual metadata enrichment from CrossRef and OpenAlex
- Better numeric citation recognition: Improved handling of
[1],[1-3], and[1,5,7]formats - Enhanced author name matching: Better handling of name variants and abbreviations
- Large PDF processing: New function to process lengthy documents in chunks
Initial release features:
- PDF import with automatic section detection
- Comprehensive citation analysis
- CrossRef integration for reference matching
- Interactive network visualization
- Text analysis and n-gram extraction
- Readability metrics
- Word distribution tracking
- Batch processing capabilities
Frequently Asked Questions
Installation
Q: How do I install the package?
devtools::install_github("massimoaria/contentanalysis")Q: What R version is required?
R >= 4.0.0 is recommended.
Usage
Q: Can I analyze scanned PDFs?
The package primarily works with text-based PDFs. However, with the new AI-enhanced extraction feature (using Google Gemini), you may get better results with scanned or low-quality PDFs. For best results with scanned documents, use OCR processing first.
Q: Do I need a DOI for analysis?
No, but providing a DOI enables automatic reference matching via CrossRef and OpenAlex, which significantly improves citation-reference linking accuracy.
Q: Is an internet connection required?
Only if you use the CrossRef/OpenAlex integration features or AI-enhanced extraction. Local analysis works offline.
Q: Can I analyze multiple papers at once?
Yes! See the Tutorial for batch processing examples.
Technical
Q: Why are some citations not detected?
Citation detection works best with standard formats (APA, MLA, Chicago, Vancouver). The package now includes improved numeric citation recognition. Non-standard or poorly formatted citations may still be missed.
Q: How accurate is the section detection?
Very accurate for standard academic papers with clear section headers. Custom keywords can be provided for non-standard formats.
Q: Can I customize the network visualization?
Yes, through the max_distance, min_connections, and show_labels parameters.
Q: How do I use the AI-enhanced extraction features? π
Get a free API key from Google AI Studio and set it as an environment variable:
Sys.setenv(GEMINI_API_KEY = "your-api-key-here")Then use pdf2txt_auto() with use_ai = TRUE or call gemini_content_ai() directly.
Q: Do I need to pay for the Gemini API? π
Google offers a free tier for the Gemini API with generous limits. Check Google AI Studio for current pricing and limits.
Support
For support:
- Check the documentation
- Read the tutorial
- Search existing issues
- Open a new issue if needed
Roadmap
Recent additions (October 2025):
- β AI-enhanced PDF extraction with Google Gemini
- β OpenAlex metadata integration
- β Improved numeric citation recognition
- β Enhanced citation matching algorithms
- β Large PDF chunk processing
Planned features for future releases:
- Further enhanced PDF layout detection
- Additional readability metrics
- Support for more citation styles
- Advanced network analysis metrics
- Integration with additional bibliographic databases
- Batch analysis reporting tools
- Machine learning-based classification
Stay tuned for updates!
Last updated: 2024