library(contentanalysis)
# Import document
doc <- pdf2txt_auto("paper.pdf", n_columns = 2)
# Basic analysis
analysis <- analyze_scientific_content(
text = doc,
window_size = 10,
remove_stopwords = TRUE
)
# View summary
print(analysis$summary)Content Analysis
Overview
The analyze_scientific_content() function performs comprehensive analysis of scientific documents, including citation extraction, reference matching, text analytics, and network analysis.
Main Function
analyze_scientific_content()
Perform comprehensive content analysis with CrossRef integration.
Usage
analyze_scientific_content(
text,
doi = NULL,
mailto = NULL,
window_size = 10,
remove_stopwords = TRUE,
custom_stopwords = NULL,
ngram_range = c(1, 3),
use_sections_for_citations = TRUE
)Arguments
text: Named list frompdf2txt_auto()containing document textdoi: Character string. DOI of the document (optional but recommended)mailto: Character string. Email for CrossRef API access (required if using DOI)window_size: Integer. Number of words before/after citations to extractremove_stopwords: Logical. Remove common English stopwordscustom_stopwords: Character vector. Additional stopwords to removengram_range: Numeric vector of length 2. Range of n-grams to extract (e.g., c(1,3) for unigrams to trigrams)use_sections_for_citations: Logical. Include section information in citation analysis
Value
A list containing:
text_analytics: Summary statistics about the textcitations: Data frame of extracted citationscitation_contexts: Citation contexts with surrounding textcitation_metrics: Citation statistics by sectioncitation_references_mapping: Matched citations and referencesparsed_references: Parsed reference entriesword_frequencies: Word frequency tablengrams: N-gram frequency listsnetwork_data: Citation co-occurrence datasection_colors: Color mapping for sectionssummary: Overall summary statistics
Basic Usage
Simple Analysis
Analyze without CrossRef integration:
With CrossRef Integration
Enhanced analysis with automatic reference matching:
# Analysis with DOI (recommended)
analysis <- analyze_scientific_content(
text = doc,
doi = "10.1016/j.mlwa.2021.100094",
mailto = "your@email.com",
window_size = 10,
remove_stopwords = TRUE,
ngram_range = c(1, 3)
)
# Check reference matching quality
matching_rate <- analysis$summary$references_matched /
analysis$summary$citations_extracted
cat("Matching rate:", round(matching_rate * 100, 1), "%\n")Understanding Results
Summary Statistics
The summary provides key metrics:
analysis$summary
# Example output:
# $total_words: 5234
# $citations_extracted: 42
# $narrative_citations: 18
# $parenthetical_citations: 24
# $references_matched: 38
# $lexical_diversity: 0.421
# $citation_density: 8.03 (citations per 1000 words)Text Analytics
Basic text statistics:
analysis$text_analytics
# Includes:
# - Total words
# - Unique words
# - Lexical diversity
# - Average word length
# - Sentence count (if available)Citation Extraction
View extracted citations:
library(dplyr)
# All citations
head(analysis$citations)
# Citation types
table(analysis$citations$citation_type)
# narrative parenthetical
# 18 24
# Citations by section
analysis$citation_metrics$section_distributionCitation Contexts
Access text surrounding citations:
# View citation contexts
contexts <- analysis$citation_contexts %>%
select(citation_text_clean, section, words_before, words_after)
head(contexts, 3)
# Example:
# citation_text_clean section words_before words_after
# "Breiman (2001)" Introduction "as shown by" "the method provides"
# "(Smith & Jones 2020)" Methods "following the approach" "we implemented the"Advanced Usage
Custom Stopwords
Add domain-specific stopwords:
# Define custom stopwords
custom_stops <- c("however", "therefore", "thus", "moreover",
"furthermore", "additionally", "specifically")
analysis <- analyze_scientific_content(
text = doc,
doi = "10.xxxx/xxxxx",
mailto = "your@email.com",
custom_stopwords = custom_stops,
remove_stopwords = TRUE
)
# Compare with default
top_words_custom <- head(analysis$word_frequencies$word, 20)Adjusting Citation Window
Control context extraction:
# Narrow window for focused context
analysis_narrow <- analyze_scientific_content(
text = doc,
window_size = 5, # 5 words before/after
doi = "10.xxxx/xxxxx",
mailto = "your@email.com"
)
# Wide window for broader context
analysis_wide <- analyze_scientific_content(
text = doc,
window_size = 20, # 20 words before/after
doi = "10.xxxx/xxxxx",
mailto = "your@email.com"
)
# Compare context lengths
mean(nchar(analysis_narrow$citation_contexts$words_before))
mean(nchar(analysis_wide$citation_contexts$words_before))N-gram Configuration
Extract different n-gram ranges:
# Only unigrams and bigrams
analysis_12 <- analyze_scientific_content(
text = doc,
ngram_range = c(1, 2)
)
# Up to 4-grams
analysis_14 <- analyze_scientific_content(
text = doc,
ngram_range = c(1, 4)
)
# Only bigrams and trigrams
analysis_23 <- analyze_scientific_content(
text = doc,
ngram_range = c(2, 3)
)
# View results
head(analysis_14$ngrams$`4gram`, 10)Working with Results
Citation Analysis Workflow
# 1. Find citations to specific author
breiman_cites <- analysis$citation_references_mapping %>%
filter(grepl("Breiman", ref_authors, ignore.case = TRUE))
nrow(breiman_cites)
# 2. Citations in Introduction
intro_cites <- analysis$citations %>%
filter(section == "Introduction")
# 3. Most cited references
cite_counts <- analysis$citation_references_mapping %>%
count(ref_full_text, sort = TRUE)
head(cite_counts, 10)
# 4. Narrative vs parenthetical by section
citation_types_by_section <- analysis$citations %>%
group_by(section, citation_type) %>%
summarise(count = n(), .groups = "drop")
print(citation_types_by_section)Text Analysis Workflow
# 1. Most frequent words
top_20_words <- head(analysis$word_frequencies, 20)
# 2. Domain-specific terms (e.g., methods)
method_terms <- analysis$word_frequencies %>%
filter(grepl("model|algorithm|method|approach", word))
head(method_terms, 10)
# 3. Most common bigrams
top_bigrams <- head(analysis$ngrams$`2gram`, 15)
# 4. Technical trigrams
tech_trigrams <- analysis$ngrams$`3gram` %>%
filter(frequency > 2)
head(tech_trigrams)Reference Matching Quality
Assess matching quality:
# View matching diagnostics
print_matching_diagnostics(analysis)
# Confidence distribution
table(analysis$citation_references_mapping$match_confidence)
# High confidence matches
high_conf <- analysis$citation_references_mapping %>%
filter(match_confidence == "high")
cat("High confidence matches:", nrow(high_conf), "\n")
# Low confidence matches (may need review)
low_conf <- analysis$citation_references_mapping %>%
filter(match_confidence == "low")
if (nrow(low_conf) > 0) {
cat("Low confidence matches requiring review:\n")
print(low_conf %>% select(citation_text_clean, ref_full_text))
}Export and Reporting
Export Analysis Results
# Create output directory
dir.create("analysis_output", showWarnings = FALSE)
# Export main results
write.csv(analysis$citations,
"analysis_output/citations.csv",
row.names = FALSE)
write.csv(analysis$citation_references_mapping,
"analysis_output/matched_references.csv",
row.names = FALSE)
write.csv(analysis$word_frequencies,
"analysis_output/word_frequencies.csv",
row.names = FALSE)
write.csv(analysis$citation_contexts,
"analysis_output/citation_contexts.csv",
row.names = FALSE)
# Export n-grams
for (n in names(analysis$ngrams)) {
write.csv(analysis$ngrams[[n]],
paste0("analysis_output/", n, ".csv"),
row.names = FALSE)
}Generate Report
# Create summary report
report <- list(
document_info = list(
doi = "10.xxxx/xxxxx",
total_words = analysis$summary$total_words,
sections = names(doc)[names(doc) != "Full_text"]
),
citation_stats = analysis$summary[grep("citation", names(analysis$summary))],
top_words = head(analysis$word_frequencies, 10),
top_bigrams = head(analysis$ngrams$`2gram`, 10),
section_citation_counts = analysis$citation_metrics$section_distribution
)
# Save as JSON
library(jsonlite)
write_json(report, "analysis_output/summary_report.json",
pretty = TRUE, auto_unbox = TRUE)Batch Processing
Process multiple documents:
# Define papers to process
papers <- data.frame(
file = c("paper1.pdf", "paper2.pdf", "paper3.pdf"),
doi = c("10.1000/1", "10.1000/2", "10.1000/3"),
stringsAsFactors = FALSE
)
# Process all
results <- list()
for (i in 1:nrow(papers)) {
cat("Processing:", papers$file[i], "\n")
doc <- pdf2txt_auto(papers$file[i], n_columns = 2)
results[[i]] <- analyze_scientific_content(
text = doc,
doi = papers$doi[i],
mailto = "your@email.com"
)
Sys.sleep(1) # Be polite to CrossRef API
}
names(results) <- papers$file
# Compare results
comparison <- data.frame(
file = papers$file,
words = sapply(results, function(r) r$summary$total_words),
citations = sapply(results, function(r) r$summary$citations_extracted),
matched = sapply(results, function(r) r$summary$references_matched),
diversity = sapply(results, function(r) r$summary$lexical_diversity)
)
print(comparison)Tips and Best Practices
DOI and Email
Always provide DOI and email when possible:
- Enables automatic CrossRef reference matching
- Dramatically improves citation-reference linking
- Provides metadata about the document
- Email should be valid (CrossRef policy)
Window Size
Choose window size based on your needs:
- Small (5-7): Focused analysis, immediate context
- Medium (8-12): Balanced approach (recommended)
- Large (15-20): Broader context, sentence-level analysis
CrossRef API
Be respectful of CrossRef API:
- Use valid email in
mailto - Add delays between requests in batch processing
- Consider rate limits for large-scale analysis
- Cache results when possible
See Also
- Citation Analysis: Deep dive into citation features
- Network Visualization: Create citation networks
- Text Analysis: Advanced text analytics
- Tutorial: Complete workflow examples