--- title: "Design" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Design} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Design philosophy This article explains the design decisions behind projr and why it works the way it does. --- ## Design goals ### 1. Minimal cognitive overhead **Goal**: Reduce the mental burden of maintaining reproducible research. **How projr achieves this:** - **One function to remember**: `projr_build()` does everything - **Sensible defaults**: Most projects work out-of-the-box - **Convention over configuration**: Standard directory structure - **Gradual complexity**: Start simple, customise as needed **Anti-pattern**: Complex build systems requiring extensive configuration, multiple commands, and deep understanding of internals. **Example**: Compare a typical `make`-based workflow: ```bash # Traditional approach make clean make data make analysis make figures make paper make deploy-to-osf git add . git commit -m "Update" git push gh release create v0.1.0 # ... upload files manually ... ``` With projr: ```{r eval=FALSE} projr_build() # That's it ``` ### 2. Fail-safe iteration **Goal**: Make it safe to experiment without losing work. **How projr achieves this:** - **Dev builds**: Route outputs to cache, never touch `_output` - **Manifests**: Always know what inputs created what outputs - **Git integration**: Automatic commits preserve history - **Reversible versioning**: Can always access previous versions via Git + archives **Anti-pattern**: In-place modification of output directories leading to lost results. **Example**: Without projr, you might: ```{r eval=FALSE} # Accidentally overwrite yesterday's figures render("analysis.Rmd") # Oh no, the new plot is worse! # Now you've lost the good version ``` With projr: ```{r eval=FALSE} # Safe iteration projr_build_dev() # Outputs to _tmp/ # Check results, if bad, just run again # If good: projr_build() # Now commit to _output ``` ### 3. Automation without magic **Goal**: Automate tedious tasks whilst maintaining transparency. **How projr achieves this:** - **Explicit configuration**: `_projr.yml` makes everything visible - **Predictable behaviour**: Same inputs → same outputs - **Inspectable artefacts**: Manifests, build logs, Git history - **No hidden state**: All configuration in version-controlled files **Anti-pattern**: Build systems with hidden state, implicit dependencies, or configuration scattered across multiple locations. **Example**: projr's manifest system: ```csv label,path,hash,version,timestamp raw-data,_raw_data,abc123...,v0.1.0,2024-01-15 ``` You can inspect, version-control, and audit this. Compare to a system that tracks dependencies in a binary database or in-memory cache. ### 4. Reproducibility by default **Goal**: Make it easier to be reproducible than not. **How projr achieves this:** - **Automatic versioning**: Every build is versioned - **Manifests**: Inputs-outputs linkage is automatic - **renv integration**: Optional package version locking - **Archiving**: Automatic upload to GitHub/OSF - **Restoration**: One command to reconstruct **Anti-pattern**: Reproducibility as an afterthought requiring manual effort. **Example**: Without thinking about it, projr users get: ``` v0.1.0: - code: Git SHA abc123 - data: Hash def456 - outputs: Hash ghi789 - packages: renv.lock - archived: GitHub Release v0.1.0 ``` All automatically. To reproduce: ```{r eval=FALSE} projr_restore_repo("owner/repo") renv::restore() projr_build() ``` --- ## Core design principles ### Single-purpose directories **Principle**: Each directory has exactly one purpose. **Rationale**: - **Clarity**: No ambiguity about where things go - **Selective sharing**: Archive only what's needed - **Automation**: Tools can act on directories knowing their contents - **Restoration**: Simple mapping from label to path **Trade-offs**: - ✅ Structure is immediately obvious - ✅ Easy to share/archive specific parts - ❌ More directories than minimal structure - ❌ Some redundancy (e.g., separate output-figures and output-tables) **Why this trade-off is worth it**: The clarity and automation benefits outweigh the slight increase in directory count. Modern file systems handle many directories efficiently. ### Versioned builds, not versioned files **Principle**: Version the entire project state, not individual files. **Rationale**: - **Coherence**: All files at version X are consistent with each other - **Simplicity**: One version number, not per-file versions - **Traceability**: Know exactly what produced what - **Restoration**: Restore entire consistent state **Trade-offs**: - ✅ Simpler mental model (one version) - ✅ Guarantees consistency - ❌ Version bumps even for small changes - ❌ Can't mix versions of different components **Why this trade-off is worth it**: Scientific outputs depend on multiple inputs. Versioning the whole project ensures you can always reconstruct a consistent state. Per-file versioning leads to combinatorial explosion of possible states. ### Configuration in YAML, not code **Principle**: Project structure and build behaviour in `_projr.yml`, not scattered across code. **Rationale**: - **Centralised**: One place to understand project configuration - **Readable**: YAML is human-readable - **Version-controlled**: Configuration changes are tracked - **Shareable**: Easy to share configuration across projects **Trade-offs**: - ✅ Configuration is explicit and visible - ✅ Easy to diff and merge configuration changes - ❌ Less flexible than code-based configuration - ❌ YAML syntax can be tricky **Why this trade-off is worth it**: Most research projects don't need the flexibility of code-based configuration. The benefits of having a single, readable, version-controlled configuration file outweigh the limitations. ### Dev builds vs final builds **Principle**: Separate safe iteration from committed releases. **Rationale**: - **Safety**: Dev builds can't accidentally overwrite released outputs - **Speed**: Dev builds skip versioning and archiving - **Clarity**: Explicit distinction between "testing" and "committing" **Trade-offs**: - ✅ Safe experimentation - ✅ Fast feedback loop - ❌ Two commands to remember (dev vs final) - ❌ Cache directory can grow large **Why this trade-off is worth it**: The safety and speed benefits are critical for iterative research. The cost of remembering two commands is minimal. ### Git integration, not Git dependency **Principle**: projr works with or without Git, but works better with it. **Rationale**: - **Accessibility**: Beginners can use projr without learning Git - **Power**: Advanced users get automatic Git integration - **Flexibility**: Use Git features without learning them **Trade-offs**: - ✅ Low barrier to entry - ✅ Automatic Git for those who want it - ❌ More complex codebase (supporting both paths) - ❌ Some features require Git (versioning) **Why this trade-off is worth it**: Git is powerful but intimidating. By making it optional, projr reaches more users whilst still offering Git benefits to those who want them. --- ## Architecture ### Layered design projr is organised into layers: ``` User-facing API (projr_build, projr_init, ...) ↓ Configuration layer (YAML parsing, validation) ↓ Build engine (rendering, versioning, archiving) ↓ Backend services (Git, GitHub, OSF, file system) ``` **Benefits**: - **Modularity**: Each layer can be tested independently - **Extensibility**: New backends (e.g., Zenodo) can be added - **Clarity**: Separation of concerns ### Function naming conventions projr uses systematic naming: - `projr_*` - All exported functions - `.projr_*` - Internal functions (not exported) - `projr_build*` - Build-related functions - `projr_init*` - Initialisation functions - `projr_yml_*` - YAML configuration functions - `projr_path_*` - Path helper functions **Benefits**: - **Discoverability**: Autocomplete groups related functions - **Clarity**: Function purpose is obvious from name - **Namespace**: All public functions prefixed to avoid conflicts ### Configuration precedence projr uses this precedence for configuration: 1. **Environment variables** (highest) 2. **Profile YAML** (`_projr-{profile}.yml`) 3. **Default YAML** (`_projr.yml`) 4. **Built-in defaults** (lowest) **Example**: ``` # Built-in default output: _output # Overridden in _projr.yml output: _my_output # Overridden in _projr-dev.yml (if PROJR_PROFILE=dev) output: _dev_output # Overridden by environment variable (if set) PROJR_OUTPUT_DIR=_temp_output ``` Final value: `_temp_output` **Benefits**: - **Flexibility**: Different contexts without editing files - **Explicitness**: Clear hierarchy of precedence - **Debuggability**: Easy to trace where a setting comes from ### Manifest format Manifests use CSV for simplicity and compatibility: ```csv label,path,hash,version,timestamp raw-data,_raw_data,abc123...,v0.1.0,2024-01-15T10:30:00Z ``` **Why CSV?** - **Universal**: Readable by any tool (R, Python, Excel) - **Simple**: No complex parsing - **Diff-friendly**: Git can show line-by-line changes - **Human-readable**: Open in text editor or spreadsheet **Alternative considered**: JSON - ✅ More structured - ❌ Less human-readable - ❌ Harder to diff - ❌ Overkill for simple tabular data --- ## Design decisions ### Why semantic versioning? **Decision**: Use x.y.z versioning (major.minor.patch) **Rationale**: - **Familiar**: Most developers know SemVer - **Expressive**: Can communicate scale of changes - **Tooling**: Many tools understand SemVer **Alternative considered**: Date-based versioning (2024-01-15) - ✅ Chronological ordering - ❌ Doesn't communicate significance of changes - ❌ Multiple versions per day require disambiguation ### Why default to GitHub Releases? **Decision**: GitHub Releases is the default archive destination **Rationale**: - **Ubiquity**: Most R projects already use GitHub - **Free**: Unlimited public releases, generous private quotas - **Integrated**: Works with existing Git workflow - **Accessible**: Web interface for downloads **Alternative considered**: OSF as primary - ✅ Designed for research - ✅ Better for large datasets - ❌ Separate account/authentication - ❌ Less familiar to R developers **Solution**: Support both; default to GitHub for familiarity. ### Why clear _output before builds? **Decision**: Default to clearing `_output` before final builds **Rationale**: - **Correctness**: Ensures outputs match current code - **No cruft**: Old outputs don't linger - **Idempotency**: Same code → same outputs **Alternative considered**: Incremental updates - ✅ Faster (only update changed files) - ❌ Risk of stale files - ❌ Non-deterministic (depends on previous state) **Solution**: Clear by default; allow override via `PROJR_OUTPUT_CLEAR`. ### Why route dev builds to cache? **Decision**: Dev builds write to `_tmp/projr/v/` not `_output` **Rationale**: - **Safety**: Can't accidentally overwrite released outputs - **Isolation**: Multiple dev builds don't conflict - **Cleanup**: Cache can be deleted without losing work **Alternative considered**: Use `_output` with flag to prevent overwrites - ✅ Simpler mental model (one output location) - ❌ Risk of accidental overwrites - ❌ Harder to keep dev and release outputs separate ### Why YAML not TOML/JSON? **Decision**: Use YAML for configuration **Rationale**: - **Familiar**: Most R users know YAML (R Markdown, pkgdown) - **Readable**: Comments, no quotes on strings - **Expressive**: Supports lists, nested structures **Alternatives considered**: **TOML**: - ✅ Simpler syntax - ❌ Less familiar in R ecosystem - ❌ Harder to nest deeply **JSON**: - ✅ Strict, machine-friendly - ❌ Less human-readable (quotes, no comments) - ❌ Harder to hand-edit --- ## Future directions ### Potential enhancements These are design considerations for future versions: **1. Incremental builds** **Idea**: Only rebuild changed documents **Pros**: Faster builds, less re-rendering **Cons**: More complexity, risk of stale outputs **Decision**: Consider for v2.0 with careful invalidation logic **2. Dependency graphs** **Idea**: Track which outputs depend on which inputs **Pros**: Finer-grained rebuilding, better traceability **Cons**: Complexity, requires analysing code **Decision**: Interesting but out-of-scope for now **3. Remote execution** **Idea**: Build on CI/cloud instead of locally **Pros**: Reproducible environment, faster hardware **Cons**: Network dependency, setup complexity **Decision**: Possible via existing CI integrations (GitHub Actions) **4. Multi-language support** **Idea**: Support Python, Julia, etc., not just R **Pros**: Broader audience, more use cases **Cons**: Different ecosystems, more maintenance **Decision**: Focus on R first; generalise later if demand exists --- ## Comparison to alternatives ### projr vs targets **targets**: Pipeline tool for dependency tracking **Similarities**: - Both focus on reproducibility - Both integrate with R Markdown **Differences**: - **targets**: Focuses on caching intermediate results - **projr**: Focuses on versioning and archiving final outputs **Use together?** Yes! Use targets for complex pipelines, projr for versioning and sharing. ### projr vs workflowr **workflowr**: Website-based project template **Similarities**: - Both provide project structure - Both integrate with Git **Differences**: - **workflowr**: Focuses on website generation - **projr**: Focuses on versioning and archiving **Use together?** Potentially, though there's overlap in Git integration. ### projr vs usethis **usethis**: Package development infrastructure **Similarities**: - Both automate setup tasks - Both follow conventions **Differences**: - **usethis**: For R packages - **projr**: For research projects **Use together?** Yes! Use usethis for package development, projr for analysis projects. --- ## Conclusion projr's design prioritises: 1. **Simplicity**: One function does everything 2. **Safety**: Dev builds can't break releases 3. **Transparency**: Configuration is visible and version-controlled 4. **Reproducibility**: Automatic versioning and archiving These principles guide every design decision, from directory structure to function naming to configuration format. The result is a tool that makes reproducible research easier than non-reproducible research—which is exactly the point.