Documentation

Everything, written down.

The SoftwareX paper, supplementary material, tutorial slide deck, and release notes — everything you need to use TALL rigorously and cite it correctly.

Peer‑reviewed · 2026 Open access · Elsevier SoftwareX

The SoftwareX paper

Aria, Spano, D'Aniello, Cuccurullo, Misuraca — TALL: Text Analysis for All — an interactive R‑Shiny application for exploring, modeling, and visualizing textual data.

Introduces the architecture, details every analytical module, and benchmarks performance across corpora spanning four orders of magnitude. Open‑access, reproducible, and the primary reference for citing TALL.

Digital Object Identifier
10.1016/j.softx.2026.102590
Supplementary material

Six sections of technical detail.

The supplementary appendix complements the paper with complete dependency lists, the FAIR‑principles audit, architecture diagrams, benchmark protocols, adoption analytics, and a step‑by‑step worked example.

Click any section below to expand its contents.

A TALL application dependencies 42 R packages

TALL is built on a curated stack of 42 open‑source R packages covering UI, data I/O, NLP, analytics, network science, async execution, and export. All dependencies are MIT‑ or similarly permissively‑licensed and declared explicitly in the DESCRIPTION file.

Shiny UI & widgets
shinycssloadersshinydashboardPlusshinyFilesshinyjsshinyWidgetsfontawesomeDTsparkline
NLP & text mining
udpipetopicmodelsword2vectextrankreadtextstringr
Networks & graphics
igraphtidygraphggraphvisNetworkplotlyumap
Statistics
caRSpectrastrucchange
Data I/O & formats
readrreadxlopenxlsxpdftoolsjsonlitepagedownbase64enc
Async & performance
RcpppromiseslaterparalleldoParallel
Data wrangling & HTTP
dplyrtidyrpurrrrlangcurlhttr2chromote

Full alphabetical table with repository URLs and descriptions in the supplementary PDF, Table S1.

B FAIR principles for research software audit · 13 criteria

TALL aligns with the FAIR Guiding Principles for research software (Wilkinson et al., 2016). Table S2 of the supplementary material maps each of the 13 criteria to a concrete TALL implementation.

F
Findable
F1 — Persistent identifiers on CRAN and tagged GitHub releases.
F2 — Rich metadata via DESCRIPTION, CITATION.cff, codemeta.json.
F3 — Metadata links directly to CRAN and GitHub identifiers.
F4 — Indexed on CRAN, GitHub, and RDocumentation.
A
Accessible
A1 — Open HTTPS retrieval via install.packages("tall").
A1.1 — Protocol is open, free, and universal.
A1.2 — Authentication supported where needed (developer access).
A2 — Historical metadata preserved via CRAN archive and GitHub history.
I
Interoperable
I1 — R & CRAN packaging conventions (DESCRIPTION, NAMESPACE).
I2Authors@R, SPDX license identifiers, CRAN controlled vocabulary.
I3 — Dependencies explicitly declared via Imports; environment requirements in README.
R
Reusable
R1 — Comprehensive manual, vignettes, and example workflows.
R1.1 — MIT license (SPDX identifier declared).
R1.2 — Full provenance through Git history, release notes, ORCIDs.
R1.3 — Compliant with CRAN and tidyverse community standards.
C Software architecture 5‑layer pipeline

TALL adopts a modular, stratified design organised into five sequential layers — each autonomous yet contributing to a coherent end‑to‑end analytical workflow.

  1. Data Import & Management

    Ingests TXT / CSV / XLSX / PDF through readtext, pdftools, readxl. Performs automatic encoding detection, metadata integration, and structured integrity checks to prevent downstream failures.

  2. Text Pre‑processing

    Tokenisation, PoS tagging, lemmatisation, and special‑entity recognition via the udpipe framework, using 87 pre‑trained models spanning 56 languages derived from Universal Dependencies Treebanks.

  3. Analysis

    Three complementary module families: Overview (descriptive stats, TF‑IDF, word clouds), Words (concordances, Reinert clustering, correspondence analysis, networks via ca/igraph/word2vec), Documents (topic modelling, polarity, summarisation via topicmodels/textrank).

  4. TALL AI Assistant

    Interpretive layer built on the Google Gemini API. Generates context‑aware natural‑language explanations alongside statistical outputs, enabling hybrid quantitative + qualitative analysis.

  5. Output & Visualisation

    Interactive dashboards via plotly and visNetwork, reproducible XLSX reports via openxlsx. Captures imported data, preprocessing steps, parameters, and outputs in a structured archival format.

TALL five-layer architecture diagram
Figure S1 TALL's layered architecture. Each layer processes the output of the previous one while enriching metadata for downstream stages.
D Computational performance benchmarks 78 → 5.3M tokens

Comprehensive benchmarking of seven core operations across 14 corpus configurations spanning four orders of magnitude. All tests conducted on a standardised research workstation with corpora sampled with replacement from Herman Melville's Moby Dick.

Hardware

Apple M4 Pro · 14 cores (10P + 4E) · ARM
48 GB unified memory · macOS 15.3
R 4.5.2 · TALL 0.5.1

Protocol

10 iterations per config · median time
mean peak memory · bench package
14 corpus configurations · 7 operations

OperationScaling5.3M tokens · timePeak memory
Preprocessing (tokenize · lemmatize · PoS)O(n)398.78 s678 MB
Multiword (RAKE, Rcpp)sub‑linear7.77 s2.7 GB
Keyness (reference χ²)vocab‑bound5.59 s4.4 GB
Topic modelling (LDA, k=5)parallelised23.98 s1.7 GB
Word embeddings (Word2Vec)parallelised11.61 s1.2 GB
Network (Louvain)linear5.14 s2.4 GB
Polarity (Hu‑Liu)linear16.48 s5.6 GB

Key findings. Preprocessing processes ~14,500 tokens/sec with stable linear scaling. Rcpp‑optimised modules (multi‑word, Reinert) remain fast at all scales. Keyness analysis is the memory bottleneck on large corpora, driven by reference vocabulary size rather than token count.

Small corpora (<10⁵ tokens) fit comfortably in 1–2 GB RAM. Medium corpora require ~5 GB. Large corpora (>10⁶ tokens) benefit from institutional server deployment.

E Adoption & CRAN downloads 4,000+ installs

Since its initial CRAN release in February 2025, TALL has exceeded 4,000 cumulative downloads by November 2025 (source: CRANlogs). Download curve shows marked acceleration between April and June 2025 during the early‑adoption phase, followed by sustained near‑linear growth through the rest of the year.

CRAN download trajectory, February to November 2025
Figure S2 Cumulative daily downloads of the tall package from CRAN, February — November 2025.

The trajectory confirms both early visibility and continued integration into analytical workflows within the R research ecosystem. TALL is currently used in postgraduate and doctoral programs at several Italian universities.

F Step‑by‑step worked example US Airline Tweets · 7 steps

A practical demonstration on the built‑in US Airline Tweets corpus — 14,640 tweets directed at six major US airlines in February 2015. Illustrates social‑media analytics, sentiment detection, and customer‑feedback analysis without requiring programming expertise.

  1. 1
    Import & AI context

    Built‑in dataset loaded; TALL AI automatically receives a pre‑configured context so every subsequent interpretation is dataset‑aware.

  2. 2
    Tokenisation & PoS tagging

    English selected as source language; pre‑trained EN GUM model employed. Unit of analysis set to lemma.

  3. 3
    Special‑entity recognition

    Automatic tagging of 127 unique emojis, 1,990 unique hashtags, and 930 user mentions as first‑class lexical units.

  4. 4
    Multi‑word detection

    RAKE algorithm surfaces collocations like "customer service", "late flight", "flight attendant". Frequency‑ranked and manually curated.

  5. 5
    PoS selection

    Retained categories: NOUN, PROPN, ADJ, HASH, MULTIWORD — unified mechanism lets standard and custom categories coexist.

  6. 6
    Overview + Words in Context

    Descriptive statistics; TALL AI recommends Word‑in‑Context and Polarity Detection as next steps. Concordances for @AmericanAirlines reveal customer‑service discourse.

  7. 7
    Polarity detection

    Hu‑Liu lexicon · 48.6% neutral, 28.0% negative (dominated by delay, cancel, miss), 23.4% positive (thank, good, great, appreciate).

TALL AI · Key takeaway

"The dominance of delay and cancel in negative tweets suggests airlines should focus on improving their processes for handling flight disruptions."

See the full walkthrough →

§ 1 · Learn

200+ slides
of hands‑on tutorial.

The official TALL tutorial walks through every module in depth — the logic behind each analytical choice, preprocessing steps, statistical and semantic methods, and interpretation strategies with worked examples.

  • · Module‑by‑module walkthrough
  • · Pre‑processing & NLP technique deep dives
  • · LDA, CA, clustering, polarity detection
  • · Example outputs & interpretation strategies
Open the tutorial
TALL tutorial slide deck preview
Figure Preview of the 200‑slide tutorial deck hosted at k‑synth.com/tall.
§ 2 · Release notes

v1.0.0 changelog.

The milestone release accompanying the SoftwareX publication.

New

SVO triplets extraction

Subject‑Verb‑Object structures from dependency trees.

New

Syntactic complexity

Metrics for measuring sentence construction across documents.

New

NRC emotion analysis

Eight‑dimensional emotion detection: anger, anticipation, disgust, fear, joy, sadness, surprise, trust.

New

Noun phrase extraction

Identify and rank noun phrases to complement single‑word analysis.

Updated

Topic modeling

CTM, STM, automatic K selection, coherence & exclusivity diagnostics.

Updated

Network analysis

Dependency‑based word networks with syntactic relation filters.

Updated

Image export

DPI‑aware rendering for publication‑quality figures.

Updated

TALL AI integration

Multiple Gemini model variants; Medium (16k) and Large (32k) context windows.

Updated

Reproducibility pipeline

Cumulative XLSX report export and session state persistence.

§ 3 · Resources

All the links, in one place.

Questions? Get in touch.

For bug reports, feature requests and contributions, use the GitHub issue tracker. For general inquiries, email the corresponding author.