TALL is organised into three coherent layers — import, pre-processing, and analysis — each composed of modules that run independently or chain into a continuous workflow. Every intermediate result is visible, exportable, and reproducible by design.
Figure The tab‑oriented navigation of TALL guides users through the analytical pipeline in order.
TALL accepts plain text, CSV, XLSX, and PDF files, with automatic character‑encoding detection to prevent loss of diacritics or non‑ASCII characters. Metadata can be uploaded in parallel — author, date, label — and later used as filters or grouping factors in downstream analyses.
A reactive Data Editor panel provides row filtering, full‑text search, and random or stratified sampling. Large corpora can be segmented into paragraphs, records, or user‑defined units, and edited corpora can be saved for subsequent sessions.
Native integration with the bibliometrix package allows direct ingestion of biblioshiny files for science‑mapping workflows.
TALL ships with 87 pre‑trained models — one per language variant — all derived from Universal Dependencies v2.15 treebanks and accessible through the companion tall.language.models repository.
Powered by the udpipe framework with language‑specific pre‑trained models. Users select tokens or lemmas as the unit of analysis; the choice propagates through all downstream modules.
Custom dictionaries, thesauri, and stopword lists can be uploaded for domain‑specific adaptation of lemmatisation and lexical normalisation.
TALL detects and tags platform‑specific features — emojis, hashtags, user mentions — distinguishing them from standard parts of speech. Their frequency distributions can be visualised and they can flow into downstream analyses as first‑class lexical units.
Paper benchmark on US Airline Tweets corpus: 127 unique emojis, 1,990 unique hashtags, 930 mentions automatically detected.
Four algorithms — including RAKE, Pointwise Mutual Information, and Morrone's IS‑score — identify salient collocations and candidate keyphrases. C++ via Rcpp keeps extraction fast even on large corpora.
Candidates can be curated manually, saved as reusable lists, and tagged in the main dataframe so that multi‑words are treated as single entries in every subsequent analysis.
NOUN, VERB, ADJ, …) are shown alongside custom categories created in previous steps.
Before moving to word‑ and document‑level analyses, users specify which lexical categories to retain. Standard categories are joined automatically by custom ones generated in earlier steps, so every relevant linguistic unit — conventional or domain‑specific — can be included on equal footing.
Word‑level and document‑level methods are grouped into coherent families. Each module is independent — you can run just one, or chain them. TALL AI adds a natural‑language interpretation alongside every output.
Frequency distributions, lexical diversity measures, TF‑IDF for discriminating terms across document sets, and interactive word clouds via plotly with zoom and tooltip interactions.
@AmericanAirlines mentions across the US Airline Tweets corpus.
Concordance analysis reveals co‑text environments and usage patterns that frequency counts alone cannot surface. Combines classical close reading with AI‑assisted pattern recognition.
Based on Wulff & Baker (2020).
Co‑word analysis via igraph and visNetwork with community detection and centrality metrics. Word embeddings through word2vec let you explore semantic relationships interactively.
Version 1.0 adds dependency‑based word networks with filters over syntactic relations — subject, object, modifier, complement — revealing grammatical structures behind the statistical patterns.
Three sentiment lexicons side‑by‑side: Hu & Liu, Loughran‑McDonald (finance), and the NRC Emotion Lexicon for eight emotions — anger, anticipation, disgust, fear, joy, sadness, surprise, trust.
Outputs include donut‑chart overall distributions, per‑document breakdowns, and top‑word panels grouped by sentiment category. Context‑sensitive rules handle negation and intensifiers.
Three topic modeling families — Latent Dirichlet Allocation (LDA), Correlated Topic Model (CTM), and Structural Topic Model (STM). Automatic K selection based on coherence and exclusivity diagnostics.
Intensive computations run asynchronously through future and promises, keeping the interface responsive on long runs.
Extractive summarisation through textrank — similarity graph with centrality ranking for key sentence extraction — plus abstractive summarisation through TALL AI for condensed narrative overviews.
All tables and statistics export as CSV or XLSX; visualisations as PNG at configurable DPI for publication quality. The Report module assembles selected outcomes into a single structured XLSX, preserving parameters and preprocessing choices for full provenance.
.tall files
Core operations scale linearly with corpus size. TALL handles tens of thousands of documents on standard research laptops — and supports server deployment for million‑token studies without changing the codebase.
Benchmarks on Apple M4 Pro · 48 GB unified memory · R 4.5.2. Full protocol in the paper supplementary material.
Every TALL release is versioned on CRAN and tagged on GitHub. Metadata is described through DESCRIPTION, CITATION.cff, and codemeta.json. All dependencies are MIT‑compatible and declared explicitly.
Globally unique identifiers through CRAN & GitHub; tagged releases; rich CRAN metadata; indexed on RDocumentation.
Open HTTPS retrieval — install.packages("tall"). Historical metadata preserved through CRAN archive & GitHub commit history.
R & CRAN packaging conventions, Authors@R, SPDX license identifiers, explicit dependency declaration.
Comprehensive manual, vignettes, example workflows. MIT license. Full provenance through versioned releases.
TALL installs in two lines of R. No compilation tricks, no surprises.