About

About contentanalysis

contentanalysis is an R package designed to facilitate comprehensive analysis of scientific literature. It provides researchers with powerful tools to extract, analyze, and visualize content from PDF documents, with a particular focus on citation analysis and text mining.

Key Features

PDF Processing

  • AI-enhanced text extraction with Google Gemini πŸ†•
  • Multi-column layout support (1, 2, or 3 columns)
  • Automatic section detection (Abstract, Introduction, Methods, Results, Discussion, etc.)
  • Structure preservation during text extraction
  • Handles complex academic paper layouts
  • Process large PDFs in chunks πŸ†•

Citation Analysis

  • Detects multiple citation formats (narrative, parenthetical, and numeric) πŸ†•
  • Enhanced numeric citation recognition ([1], [1-3], [1,5,7]) πŸ†•
  • Extracts citation contexts with configurable window sizes
  • Automatic reference matching via CrossRef and OpenAlex πŸ†•
  • Improved matching algorithms with confidence scoring πŸ†•
  • Citation network visualization
  • Co-citation analysis

Text Analytics

  • Word frequency analysis with stopword removal
  • N-gram extraction (bigrams, trigrams, etc.)
  • Word distribution tracking across document sections
  • Readability metrics (Flesch Reading Ease, Flesch-Kincaid, etc.)
  • Lexical diversity measures

Visualization

  • Interactive citation networks with visNetwork
  • Word distribution plots (line, bar, area)
  • Customizable color schemes and layouts
  • Publication-ready figures

Use Cases

The package is designed for:

  • Systematic Literature Reviews: Analyze citation patterns and content across multiple papers
  • Bibliometric Studies: Map citation networks and identify influential works
  • Content Analysis Research: Extract and quantify themes, concepts, and terminology
  • Academic Writing Assessment: Evaluate readability and citation practices
  • Research Methodology: Study how citations are deployed in scientific arguments

Development

Author

Massimo Aria, Ph.D.
Department of Economics and Statistics
University of Naples Federico II

Technology Stack

contentanalysis is built using:

  • R: Statistical computing environment
  • pdftools: PDF text extraction
  • Google Gemini API: AI-enhanced text processing πŸ†•
  • rcrossref: CrossRef API integration
  • OpenAlex API: Bibliographic metadata enrichment πŸ†•
  • visNetwork: Interactive network visualization
  • quanteda: Text analysis and natural language processing
  • dplyr/tidyr: Data manipulation
  • httr2: HTTP client for AI API communication πŸ†•

Contributing

We welcome contributions! Here’s how you can help:

Report Bugs

If you find a bug, please open an issue with:

  • A clear description of the problem
  • Reproducible example
  • Your R version and operating system
  • Package version

Suggest Features

Have an idea for improvement? Open an issue describing:

  • The feature you’d like to see
  • Why it would be useful
  • Example use cases

Submit Pull Requests

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Citation

If you use contentanalysis in your research, please cite:

@Manual{contentanalysis,
  title = {contentanalysis: Comprehensive Tools for Scientific Content Analysis},
  author = {Massimo Aria},
  year = {2024},
  note = {R package version 0.1.0},
  url = {https://github.com/massimoaria/contentanalysis}
}

License

This package is free and open source software, licensed under GPL-3.

GPL-3 License

Copyright (c) 2024 Massimo Aria

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

Acknowledgments

Special thanks to:

  • The R Core Team for the R language
  • Contributors to all dependent packages
  • The open science community
  • Users who provide feedback and suggestions

Contact

Version History

Version 0.1.0 (Current)

Recent updates (October 2025):

  • AI-enhanced PDF extraction: Integration with Google Gemini API for improved text extraction from complex layouts
  • Improved citation matching: Enhanced algorithms for matching citations to references
  • OpenAlex integration: Dual metadata enrichment from CrossRef and OpenAlex
  • Better numeric citation recognition: Improved handling of [1], [1-3], and [1,5,7] formats
  • Enhanced author name matching: Better handling of name variants and abbreviations
  • Large PDF processing: New function to process lengthy documents in chunks

Initial release features:

  • PDF import with automatic section detection
  • Comprehensive citation analysis
  • CrossRef integration for reference matching
  • Interactive network visualization
  • Text analysis and n-gram extraction
  • Readability metrics
  • Word distribution tracking
  • Batch processing capabilities

Frequently Asked Questions

Installation

Q: How do I install the package?

devtools::install_github("massimoaria/contentanalysis")

Q: What R version is required?

R >= 4.0.0 is recommended.

Usage

Q: Can I analyze scanned PDFs?

The package primarily works with text-based PDFs. However, with the new AI-enhanced extraction feature (using Google Gemini), you may get better results with scanned or low-quality PDFs. For best results with scanned documents, use OCR processing first.

Q: Do I need a DOI for analysis?

No, but providing a DOI enables automatic reference matching via CrossRef and OpenAlex, which significantly improves citation-reference linking accuracy.

Q: Is an internet connection required?

Only if you use the CrossRef/OpenAlex integration features or AI-enhanced extraction. Local analysis works offline.

Q: Can I analyze multiple papers at once?

Yes! See the Tutorial for batch processing examples.

Technical

Q: Why are some citations not detected?

Citation detection works best with standard formats (APA, MLA, Chicago, Vancouver). The package now includes improved numeric citation recognition. Non-standard or poorly formatted citations may still be missed.

Q: How accurate is the section detection?

Very accurate for standard academic papers with clear section headers. Custom keywords can be provided for non-standard formats.

Q: Can I customize the network visualization?

Yes, through the max_distance, min_connections, and show_labels parameters.

Q: How do I use the AI-enhanced extraction features? πŸ†•

Get a free API key from Google AI Studio and set it as an environment variable:

Sys.setenv(GEMINI_API_KEY = "your-api-key-here")

Then use pdf2txt_auto() with use_ai = TRUE or call gemini_content_ai() directly.

Q: Do I need to pay for the Gemini API? πŸ†•

Google offers a free tier for the Gemini API with generous limits. Check Google AI Studio for current pricing and limits.

Support

For support:

  1. Check the documentation
  2. Read the tutorial
  3. Search existing issues
  4. Open a new issue if needed

Roadmap

Recent additions (October 2025):

  • βœ… AI-enhanced PDF extraction with Google Gemini
  • βœ… OpenAlex metadata integration
  • βœ… Improved numeric citation recognition
  • βœ… Enhanced citation matching algorithms
  • βœ… Large PDF chunk processing

Planned features for future releases:

  • Further enhanced PDF layout detection
  • Additional readability metrics
  • Support for more citation styles
  • Advanced network analysis metrics
  • Integration with additional bibliographic databases
  • Batch analysis reporting tools
  • Machine learning-based classification

Stay tuned for updates!


Last updated: 2024