---
title: "Design"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Design}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Design philosophy
This article explains the design decisions behind projr and why it works the way it does.
---
## Design goals
### 1. Minimal cognitive overhead
**Goal**: Reduce the mental burden of maintaining reproducible research.
**How projr achieves this:**
- **One function to remember**: `projr_build()` does everything
- **Sensible defaults**: Most projects work out-of-the-box
- **Convention over configuration**: Standard directory structure
- **Gradual complexity**: Start simple, customise as needed
**Anti-pattern**: Complex build systems requiring extensive configuration, multiple commands, and deep understanding of internals.
**Example**: Compare a typical `make`-based workflow:
```bash
# Traditional approach
make clean
make data
make analysis
make figures
make paper
make deploy-to-osf
git add .
git commit -m "Update"
git push
gh release create v0.1.0
# ... upload files manually ...
```
With projr:
```{r eval=FALSE}
projr_build() # That's it
```
### 2. Fail-safe iteration
**Goal**: Make it safe to experiment without losing work.
**How projr achieves this:**
- **Dev builds**: Route outputs to cache, never touch `_output`
- **Manifests**: Always know what inputs created what outputs
- **Git integration**: Automatic commits preserve history
- **Reversible versioning**: Can always access previous versions via Git + archives
**Anti-pattern**: In-place modification of output directories leading to lost results.
**Example**: Without projr, you might:
```{r eval=FALSE}
# Accidentally overwrite yesterday's figures
render("analysis.Rmd") # Oh no, the new plot is worse!
# Now you've lost the good version
```
With projr:
```{r eval=FALSE}
# Safe iteration
projr_build_dev() # Outputs to _tmp/
# Check results, if bad, just run again
# If good:
projr_build() # Now commit to _output
```
### 3. Automation without magic
**Goal**: Automate tedious tasks whilst maintaining transparency.
**How projr achieves this:**
- **Explicit configuration**: `_projr.yml` makes everything visible
- **Predictable behaviour**: Same inputs → same outputs
- **Inspectable artefacts**: Manifests, build logs, Git history
- **No hidden state**: All configuration in version-controlled files
**Anti-pattern**: Build systems with hidden state, implicit dependencies, or configuration scattered across multiple locations.
**Example**: projr's manifest system:
```csv
label,path,hash,version,timestamp
raw-data,_raw_data,abc123...,v0.1.0,2024-01-15
```
You can inspect, version-control, and audit this. Compare to a system that tracks dependencies in a binary database or in-memory cache.
### 4. Reproducibility by default
**Goal**: Make it easier to be reproducible than not.
**How projr achieves this:**
- **Automatic versioning**: Every build is versioned
- **Manifests**: Inputs-outputs linkage is automatic
- **renv integration**: Optional package version locking
- **Archiving**: Automatic upload to GitHub/OSF
- **Restoration**: One command to reconstruct
**Anti-pattern**: Reproducibility as an afterthought requiring manual effort.
**Example**: Without thinking about it, projr users get:
```
v0.1.0:
- code: Git SHA abc123
- data: Hash def456
- outputs: Hash ghi789
- packages: renv.lock
- archived: GitHub Release v0.1.0
```
All automatically. To reproduce:
```{r eval=FALSE}
projr_restore_repo("owner/repo")
renv::restore()
projr_build()
```
---
## Core design principles
### Single-purpose directories
**Principle**: Each directory has exactly one purpose.
**Rationale**:
- **Clarity**: No ambiguity about where things go
- **Selective sharing**: Archive only what's needed
- **Automation**: Tools can act on directories knowing their contents
- **Restoration**: Simple mapping from label to path
**Trade-offs**:
- ✅ Structure is immediately obvious
- ✅ Easy to share/archive specific parts
- ❌ More directories than minimal structure
- ❌ Some redundancy (e.g., separate output-figures and output-tables)
**Why this trade-off is worth it**:
The clarity and automation benefits outweigh the slight increase in directory count. Modern file systems handle many directories efficiently.
### Versioned builds, not versioned files
**Principle**: Version the entire project state, not individual files.
**Rationale**:
- **Coherence**: All files at version X are consistent with each other
- **Simplicity**: One version number, not per-file versions
- **Traceability**: Know exactly what produced what
- **Restoration**: Restore entire consistent state
**Trade-offs**:
- ✅ Simpler mental model (one version)
- ✅ Guarantees consistency
- ❌ Version bumps even for small changes
- ❌ Can't mix versions of different components
**Why this trade-off is worth it**:
Scientific outputs depend on multiple inputs. Versioning the whole project ensures you can always reconstruct a consistent state. Per-file versioning leads to combinatorial explosion of possible states.
### Configuration in YAML, not code
**Principle**: Project structure and build behaviour in `_projr.yml`, not scattered across code.
**Rationale**:
- **Centralised**: One place to understand project configuration
- **Readable**: YAML is human-readable
- **Version-controlled**: Configuration changes are tracked
- **Shareable**: Easy to share configuration across projects
**Trade-offs**:
- ✅ Configuration is explicit and visible
- ✅ Easy to diff and merge configuration changes
- ❌ Less flexible than code-based configuration
- ❌ YAML syntax can be tricky
**Why this trade-off is worth it**:
Most research projects don't need the flexibility of code-based configuration. The benefits of having a single, readable, version-controlled configuration file outweigh the limitations.
### Dev builds vs final builds
**Principle**: Separate safe iteration from committed releases.
**Rationale**:
- **Safety**: Dev builds can't accidentally overwrite released outputs
- **Speed**: Dev builds skip versioning and archiving
- **Clarity**: Explicit distinction between "testing" and "committing"
**Trade-offs**:
- ✅ Safe experimentation
- ✅ Fast feedback loop
- ❌ Two commands to remember (dev vs final)
- ❌ Cache directory can grow large
**Why this trade-off is worth it**:
The safety and speed benefits are critical for iterative research. The cost of remembering two commands is minimal.
### Git integration, not Git dependency
**Principle**: projr works with or without Git, but works better with it.
**Rationale**:
- **Accessibility**: Beginners can use projr without learning Git
- **Power**: Advanced users get automatic Git integration
- **Flexibility**: Use Git features without learning them
**Trade-offs**:
- ✅ Low barrier to entry
- ✅ Automatic Git for those who want it
- ❌ More complex codebase (supporting both paths)
- ❌ Some features require Git (versioning)
**Why this trade-off is worth it**:
Git is powerful but intimidating. By making it optional, projr reaches more users whilst still offering Git benefits to those who want them.
---
## Architecture
### Layered design
projr is organised into layers:
```
User-facing API (projr_build, projr_init, ...)
↓
Configuration layer (YAML parsing, validation)
↓
Build engine (rendering, versioning, archiving)
↓
Backend services (Git, GitHub, OSF, file system)
```
**Benefits**:
- **Modularity**: Each layer can be tested independently
- **Extensibility**: New backends (e.g., Zenodo) can be added
- **Clarity**: Separation of concerns
### Function naming conventions
projr uses systematic naming:
- `projr_*` - All exported functions
- `.projr_*` - Internal functions (not exported)
- `projr_build*` - Build-related functions
- `projr_init*` - Initialisation functions
- `projr_yml_*` - YAML configuration functions
- `projr_path_*` - Path helper functions
**Benefits**:
- **Discoverability**: Autocomplete groups related functions
- **Clarity**: Function purpose is obvious from name
- **Namespace**: All public functions prefixed to avoid conflicts
### Configuration precedence
projr uses this precedence for configuration:
1. **Environment variables** (highest)
2. **Profile YAML** (`_projr-{profile}.yml`)
3. **Default YAML** (`_projr.yml`)
4. **Built-in defaults** (lowest)
**Example**:
```
# Built-in default
output: _output
# Overridden in _projr.yml
output: _my_output
# Overridden in _projr-dev.yml (if PROJR_PROFILE=dev)
output: _dev_output
# Overridden by environment variable (if set)
PROJR_OUTPUT_DIR=_temp_output
```
Final value: `_temp_output`
**Benefits**:
- **Flexibility**: Different contexts without editing files
- **Explicitness**: Clear hierarchy of precedence
- **Debuggability**: Easy to trace where a setting comes from
### Manifest format
Manifests use CSV for simplicity and compatibility:
```csv
label,path,hash,version,timestamp
raw-data,_raw_data,abc123...,v0.1.0,2024-01-15T10:30:00Z
```
**Why CSV?**
- **Universal**: Readable by any tool (R, Python, Excel)
- **Simple**: No complex parsing
- **Diff-friendly**: Git can show line-by-line changes
- **Human-readable**: Open in text editor or spreadsheet
**Alternative considered**: JSON
- ✅ More structured
- ❌ Less human-readable
- ❌ Harder to diff
- ❌ Overkill for simple tabular data
---
## Design decisions
### Why semantic versioning?
**Decision**: Use x.y.z versioning (major.minor.patch)
**Rationale**:
- **Familiar**: Most developers know SemVer
- **Expressive**: Can communicate scale of changes
- **Tooling**: Many tools understand SemVer
**Alternative considered**: Date-based versioning (2024-01-15)
- ✅ Chronological ordering
- ❌ Doesn't communicate significance of changes
- ❌ Multiple versions per day require disambiguation
### Why default to GitHub Releases?
**Decision**: GitHub Releases is the default archive destination
**Rationale**:
- **Ubiquity**: Most R projects already use GitHub
- **Free**: Unlimited public releases, generous private quotas
- **Integrated**: Works with existing Git workflow
- **Accessible**: Web interface for downloads
**Alternative considered**: OSF as primary
- ✅ Designed for research
- ✅ Better for large datasets
- ❌ Separate account/authentication
- ❌ Less familiar to R developers
**Solution**: Support both; default to GitHub for familiarity.
### Why clear _output before builds?
**Decision**: Default to clearing `_output` before final builds
**Rationale**:
- **Correctness**: Ensures outputs match current code
- **No cruft**: Old outputs don't linger
- **Idempotency**: Same code → same outputs
**Alternative considered**: Incremental updates
- ✅ Faster (only update changed files)
- ❌ Risk of stale files
- ❌ Non-deterministic (depends on previous state)
**Solution**: Clear by default; allow override via `PROJR_OUTPUT_CLEAR`.
### Why route dev builds to cache?
**Decision**: Dev builds write to `_tmp/projr/v/` not `_output`
**Rationale**:
- **Safety**: Can't accidentally overwrite released outputs
- **Isolation**: Multiple dev builds don't conflict
- **Cleanup**: Cache can be deleted without losing work
**Alternative considered**: Use `_output` with flag to prevent overwrites
- ✅ Simpler mental model (one output location)
- ❌ Risk of accidental overwrites
- ❌ Harder to keep dev and release outputs separate
### Why YAML not TOML/JSON?
**Decision**: Use YAML for configuration
**Rationale**:
- **Familiar**: Most R users know YAML (R Markdown, pkgdown)
- **Readable**: Comments, no quotes on strings
- **Expressive**: Supports lists, nested structures
**Alternatives considered**:
**TOML**:
- ✅ Simpler syntax
- ❌ Less familiar in R ecosystem
- ❌ Harder to nest deeply
**JSON**:
- ✅ Strict, machine-friendly
- ❌ Less human-readable (quotes, no comments)
- ❌ Harder to hand-edit
---
## Future directions
### Potential enhancements
These are design considerations for future versions:
**1. Incremental builds**
**Idea**: Only rebuild changed documents
**Pros**: Faster builds, less re-rendering
**Cons**: More complexity, risk of stale outputs
**Decision**: Consider for v2.0 with careful invalidation logic
**2. Dependency graphs**
**Idea**: Track which outputs depend on which inputs
**Pros**: Finer-grained rebuilding, better traceability
**Cons**: Complexity, requires analysing code
**Decision**: Interesting but out-of-scope for now
**3. Remote execution**
**Idea**: Build on CI/cloud instead of locally
**Pros**: Reproducible environment, faster hardware
**Cons**: Network dependency, setup complexity
**Decision**: Possible via existing CI integrations (GitHub Actions)
**4. Multi-language support**
**Idea**: Support Python, Julia, etc., not just R
**Pros**: Broader audience, more use cases
**Cons**: Different ecosystems, more maintenance
**Decision**: Focus on R first; generalise later if demand exists
---
## Comparison to alternatives
### projr vs targets
**targets**: Pipeline tool for dependency tracking
**Similarities**:
- Both focus on reproducibility
- Both integrate with R Markdown
**Differences**:
- **targets**: Focuses on caching intermediate results
- **projr**: Focuses on versioning and archiving final outputs
**Use together?** Yes! Use targets for complex pipelines, projr for versioning and sharing.
### projr vs workflowr
**workflowr**: Website-based project template
**Similarities**:
- Both provide project structure
- Both integrate with Git
**Differences**:
- **workflowr**: Focuses on website generation
- **projr**: Focuses on versioning and archiving
**Use together?** Potentially, though there's overlap in Git integration.
### projr vs usethis
**usethis**: Package development infrastructure
**Similarities**:
- Both automate setup tasks
- Both follow conventions
**Differences**:
- **usethis**: For R packages
- **projr**: For research projects
**Use together?** Yes! Use usethis for package development, projr for analysis projects.
---
## Conclusion
projr's design prioritises:
1. **Simplicity**: One function does everything
2. **Safety**: Dev builds can't break releases
3. **Transparency**: Configuration is visible and version-controlled
4. **Reproducibility**: Automatic versioning and archiving
These principles guide every design decision, from directory structure to function naming to configuration format.
The result is a tool that makes reproducible research easier than non-reproducible research—which is exactly the point.