The SoftwareX paper, supplementary material, tutorial slide deck, and release notes — everything you need to use TALL rigorously and cite it correctly.
Aria, Spano, D'Aniello, Cuccurullo, Misuraca — TALL: Text Analysis for All — an interactive R‑Shiny application for exploring, modeling, and visualizing textual data.
Introduces the architecture, details every analytical module, and benchmarks performance across corpora spanning four orders of magnitude. Open‑access, reproducible, and the primary reference for citing TALL.
The supplementary appendix complements the paper with complete dependency lists, the FAIR‑principles audit, architecture diagrams, benchmark protocols, adoption analytics, and a step‑by‑step worked example.
Click any section below to expand its contents.
TALL is built on a curated stack of 42 open‑source R packages covering UI, data I/O, NLP, analytics, network science, async execution, and export. All dependencies are MIT‑ or similarly permissively‑licensed and declared explicitly in the DESCRIPTION file.
Full alphabetical table with repository URLs and descriptions in the supplementary PDF, Table S1.
TALL aligns with the FAIR Guiding Principles for research software (Wilkinson et al., 2016). Table S2 of the supplementary material maps each of the 13 criteria to a concrete TALL implementation.
DESCRIPTION, CITATION.cff, codemeta.json.install.packages("tall").DESCRIPTION, NAMESPACE).Authors@R, SPDX license identifiers, CRAN controlled vocabulary.Imports; environment requirements in README.
TALL adopts a modular, stratified design organised into five sequential layers — each autonomous yet contributing to a coherent end‑to‑end analytical workflow.
Ingests TXT / CSV / XLSX / PDF through readtext, pdftools, readxl. Performs automatic encoding detection, metadata integration, and structured integrity checks to prevent downstream failures.
Tokenisation, PoS tagging, lemmatisation, and special‑entity recognition via the udpipe framework, using 87 pre‑trained models spanning 56 languages derived from Universal Dependencies Treebanks.
Three complementary module families: Overview (descriptive stats, TF‑IDF, word clouds), Words (concordances, Reinert clustering, correspondence analysis, networks via ca/igraph/word2vec), Documents (topic modelling, polarity, summarisation via topicmodels/textrank).
Interpretive layer built on the Google Gemini API. Generates context‑aware natural‑language explanations alongside statistical outputs, enabling hybrid quantitative + qualitative analysis.
Interactive dashboards via plotly and visNetwork, reproducible XLSX reports via openxlsx. Captures imported data, preprocessing steps, parameters, and outputs in a structured archival format.
Comprehensive benchmarking of seven core operations across 14 corpus configurations spanning four orders of magnitude. All tests conducted on a standardised research workstation with corpora sampled with replacement from Herman Melville's Moby Dick.
Apple M4 Pro · 14 cores (10P + 4E) · ARM
48 GB unified memory · macOS 15.3
R 4.5.2 · TALL 0.5.1
10 iterations per config · median time
mean peak memory · bench package
14 corpus configurations · 7 operations
| Operation | Scaling | 5.3M tokens · time | Peak memory |
|---|---|---|---|
| Preprocessing (tokenize · lemmatize · PoS) | O(n) | 398.78 s | 678 MB |
| Multiword (RAKE, Rcpp) | sub‑linear | 7.77 s | 2.7 GB |
| Keyness (reference χ²) | vocab‑bound | 5.59 s | 4.4 GB |
| Topic modelling (LDA, k=5) | parallelised | 23.98 s | 1.7 GB |
| Word embeddings (Word2Vec) | parallelised | 11.61 s | 1.2 GB |
| Network (Louvain) | linear | 5.14 s | 2.4 GB |
| Polarity (Hu‑Liu) | linear | 16.48 s | 5.6 GB |
Key findings. Preprocessing processes ~14,500 tokens/sec with stable linear scaling. Rcpp‑optimised modules (multi‑word, Reinert) remain fast at all scales. Keyness analysis is the memory bottleneck on large corpora, driven by reference vocabulary size rather than token count.
Small corpora (<10⁵ tokens) fit comfortably in 1–2 GB RAM. Medium corpora require ~5 GB. Large corpora (>10⁶ tokens) benefit from institutional server deployment.
Since its initial CRAN release in February 2025, TALL has exceeded 4,000 cumulative downloads by November 2025 (source: CRANlogs). Download curve shows marked acceleration between April and June 2025 during the early‑adoption phase, followed by sustained near‑linear growth through the rest of the year.
tall package from CRAN, February — November 2025.The trajectory confirms both early visibility and continued integration into analytical workflows within the R research ecosystem. TALL is currently used in postgraduate and doctoral programs at several Italian universities.
A practical demonstration on the built‑in US Airline Tweets corpus — 14,640 tweets directed at six major US airlines in February 2015. Illustrates social‑media analytics, sentiment detection, and customer‑feedback analysis without requiring programming expertise.
Built‑in dataset loaded; TALL AI automatically receives a pre‑configured context so every subsequent interpretation is dataset‑aware.
English selected as source language; pre‑trained EN GUM model employed. Unit of analysis set to lemma.
Automatic tagging of 127 unique emojis, 1,990 unique hashtags, and 930 user mentions as first‑class lexical units.
RAKE algorithm surfaces collocations like "customer service", "late flight", "flight attendant". Frequency‑ranked and manually curated.
Retained categories: NOUN, PROPN, ADJ, HASH, MULTIWORD — unified mechanism lets standard and custom categories coexist.
Descriptive statistics; TALL AI recommends Word‑in‑Context and Polarity Detection as next steps. Concordances for @AmericanAirlines reveal customer‑service discourse.
Hu‑Liu lexicon · 48.6% neutral, 28.0% negative (dominated by delay, cancel, miss), 23.4% positive (thank, good, great, appreciate).
"The dominance of delay and cancel in negative tweets suggests airlines should focus on improving their processes for handling flight disruptions."
The official TALL tutorial walks through every module in depth — the logic behind each analytical choice, preprocessing steps, statistical and semantic methods, and interpretation strategies with worked examples.
k‑synth.com/tall.The milestone release accompanying the SoftwareX publication.
Subject‑Verb‑Object structures from dependency trees.
Metrics for measuring sentence construction across documents.
Eight‑dimensional emotion detection: anger, anticipation, disgust, fear, joy, sadness, surprise, trust.
Identify and rank noun phrases to complement single‑word analysis.
CTM, STM, automatic K selection, coherence & exclusivity diagnostics.
Dependency‑based word networks with syntactic relation filters.
DPI‑aware rendering for publication‑quality figures.
Multiple Gemini model variants; Medium (16k) and Large (32k) context windows.
Cumulative XLSX report export and session state persistence.
For bug reports, feature requests and contributions, use the GitHub issue tracker. For general inquiries, email the corresponding author.