This article explains the design decisions behind projr and why it works the way it does.
Goal: Reduce the mental burden of maintaining reproducible research.
How projr achieves this:
projr_build() does everythingAnti-pattern: Complex build systems requiring extensive configuration, multiple commands, and deep understanding of internals.
Example: Compare a typical make-based
workflow:
# Traditional approach
make clean
make data
make analysis
make figures
make paper
make deploy-to-osf
git add .
git commit -m "Update"
git push
gh release create v0.1.0
# ... upload files manually ...With projr:
Goal: Make it safe to experiment without losing work.
How projr achieves this:
_outputAnti-pattern: In-place modification of output directories leading to lost results.
Example: Without projr, you might:
# Accidentally overwrite yesterday's figures
render("analysis.Rmd") # Oh no, the new plot is worse!
# Now you've lost the good versionWith projr:
Goal: Automate tedious tasks whilst maintaining transparency.
How projr achieves this:
_projr.yml
makes everything visibleAnti-pattern: Build systems with hidden state, implicit dependencies, or configuration scattered across multiple locations.
Example: projr’s manifest system:
label,path,hash,version,timestamp
raw-data,_raw_data,abc123...,v0.1.0,2024-01-15
You can inspect, version-control, and audit this. Compare to a system that tracks dependencies in a binary database or in-memory cache.
Goal: Make it easier to be reproducible than not.
How projr achieves this:
Anti-pattern: Reproducibility as an afterthought requiring manual effort.
Example: Without thinking about it, projr users get:
v0.1.0:
- code: Git SHA abc123
- data: Hash def456
- outputs: Hash ghi789
- packages: renv.lock
- archived: GitHub Release v0.1.0
All automatically. To reproduce:
Principle: Each directory has exactly one purpose.
Rationale:
Trade-offs:
Why this trade-off is worth it:
The clarity and automation benefits outweigh the slight increase in directory count. Modern file systems handle many directories efficiently.
Principle: Version the entire project state, not individual files.
Rationale:
Trade-offs:
Why this trade-off is worth it:
Scientific outputs depend on multiple inputs. Versioning the whole project ensures you can always reconstruct a consistent state. Per-file versioning leads to combinatorial explosion of possible states.
Principle: Project structure and build behaviour in
_projr.yml, not scattered across code.
Rationale:
Trade-offs:
Why this trade-off is worth it:
Most research projects don’t need the flexibility of code-based configuration. The benefits of having a single, readable, version-controlled configuration file outweigh the limitations.
Principle: Separate safe iteration from committed releases.
Rationale:
Trade-offs:
Why this trade-off is worth it:
The safety and speed benefits are critical for iterative research. The cost of remembering two commands is minimal.
Principle: projr works with or without Git, but works better with it.
Rationale:
Trade-offs:
Why this trade-off is worth it:
Git is powerful but intimidating. By making it optional, projr reaches more users whilst still offering Git benefits to those who want them.
projr is organised into layers:
User-facing API (projr_build, projr_init, ...)
↓
Configuration layer (YAML parsing, validation)
↓
Build engine (rendering, versioning, archiving)
↓
Backend services (Git, GitHub, OSF, file system)
Benefits:
projr uses systematic naming:
projr_* - All exported functions.projr_* - Internal functions (not exported)projr_build* - Build-related functionsprojr_init* - Initialisation functionsprojr_yml_* - YAML configuration functionsprojr_path_* - Path helper functionsBenefits:
projr uses this precedence for configuration:
_projr-{profile}.yml)_projr.yml)Example:
# Built-in default
output: _output
# Overridden in _projr.yml
output: _my_output
# Overridden in _projr-dev.yml (if PROJR_PROFILE=dev)
output: _dev_output
# Overridden by environment variable (if set)
PROJR_OUTPUT_DIR=_temp_output
Final value: _temp_output
Benefits:
Manifests use CSV for simplicity and compatibility:
label,path,hash,version,timestamp
raw-data,_raw_data,abc123...,v0.1.0,2024-01-15T10:30:00Z
Why CSV?
Alternative considered: JSON
Decision: Use x.y.z versioning (major.minor.patch)
Rationale:
Alternative considered: Date-based versioning (2024-01-15)
Decision: GitHub Releases is the default archive destination
Rationale:
Alternative considered: OSF as primary
Solution: Support both; default to GitHub for familiarity.
Decision: Default to clearing _output
before final builds
Rationale:
Alternative considered: Incremental updates
Solution: Clear by default; allow override via
PROJR_OUTPUT_CLEAR.
Decision: Dev builds write to
_tmp/projr/v<version>/ not _output
Rationale:
Alternative considered: Use _output
with flag to prevent overwrites
Decision: Use YAML for configuration
Rationale:
Alternatives considered:
TOML: - ✅ Simpler syntax - ❌ Less familiar in R ecosystem - ❌ Harder to nest deeply
JSON: - ✅ Strict, machine-friendly - ❌ Less human-readable (quotes, no comments) - ❌ Harder to hand-edit
These are design considerations for future versions:
1. Incremental builds
Idea: Only rebuild changed documents
Pros: Faster builds, less re-rendering
Cons: More complexity, risk of stale outputs
Decision: Consider for v2.0 with careful invalidation logic
2. Dependency graphs
Idea: Track which outputs depend on which inputs
Pros: Finer-grained rebuilding, better traceability
Cons: Complexity, requires analysing code
Decision: Interesting but out-of-scope for now
3. Remote execution
Idea: Build on CI/cloud instead of locally
Pros: Reproducible environment, faster hardware
Cons: Network dependency, setup complexity
Decision: Possible via existing CI integrations (GitHub Actions)
4. Multi-language support
Idea: Support Python, Julia, etc., not just R
Pros: Broader audience, more use cases
Cons: Different ecosystems, more maintenance
Decision: Focus on R first; generalise later if demand exists
targets: Pipeline tool for dependency tracking
Similarities: - Both focus on reproducibility - Both integrate with R Markdown
Differences: - targets: Focuses on caching intermediate results - projr: Focuses on versioning and archiving final outputs
Use together? Yes! Use targets for complex pipelines, projr for versioning and sharing.
workflowr: Website-based project template
Similarities: - Both provide project structure - Both integrate with Git
Differences: - workflowr: Focuses on website generation - projr: Focuses on versioning and archiving
Use together? Potentially, though there’s overlap in Git integration.
usethis: Package development infrastructure
Similarities: - Both automate setup tasks - Both follow conventions
Differences: - usethis: For R packages - projr: For research projects
Use together? Yes! Use usethis for package development, projr for analysis projects.
projr’s design prioritises:
These principles guide every design decision, from directory structure to function naming to configuration format.
The result is a tool that makes reproducible research easier than non-reproducible research—which is exactly the point.