Siblings and RIA stores¶
Status: draft
Sources & anchors
- Stack component: DataLad
- Canonical artifact: survey component 2 §Convergent patterns RIA layout
- Workshop session: Day-1 AM session 2 (DataLad)
- Outline:
_outline.md§B
Frame¶
A DataLad dataset can push to and pull from multiple named remotes — siblings — each playing a different role: a RIA store for content, GitHub for metadata and discoverability, a GitLab Pages target for publication. This chapter covers how siblings work, how RIA stores are structured, and what projio's helper layer adds.
What a sibling is¶
A DataLad sibling extends the git concept of a remote to cover annex
content. Where a bare git remote stores commits and trees, a DataLad sibling
stores commits, trees, and annexed file content — or a subset of it. Every
subdataset can have multiple siblings with different purposes. Each sibling has
a name (origin, ria-store, github) and its own push/pull semantics.
Adding a sibling is one command:
datalad siblings add \
-d . \
--name ria-store \
--url "ria+file:///storage2/ria-store" \
--pushurl "ria+file:///storage2/ria-store"
The --url is used for fetching; --pushurl is used for writing. For a
local RIA store the two are identical. For a remote SSH sibling they may
differ — SSH for writes, HTTPS for reads.
RIA stores: structure and access¶
RIA (Remote Indexed Archive) is DataLad's content-addressed object store. A
RIA store is a directory organized as a two-level hash tree
(ab/cde.../objects/...) where objects are stored by dataset ID and content
hash. The store supports two protocols:
ria+file:///path/to/store— local filesystem access, no network overheadria+ssh://user@host/path/to/store— SSH access to a remote host
Both protocols are transparent to DataLad. The datalad push --to ria-store
command is identical whether the store is local or remote.
A RIA store contains aliases — named pointers to datasets inside the store,
recorded in store/alias/. The #~cogpy fragment in a datalad-url resolves
through the alias table: ria+file:///storage/share/git/ria-store#~cogpy finds
the cogpy dataset regardless of where it lives inside the store's hash tree.
Aliases let multiple superdatasets share a stable reference to a library without
coupling to its internal storage layout.
The pattern across study projects: a project-local store at
/storage2/ria-store/ holds all study-specific subdatasets for that project;
a shared lab store at /storage/share/git/ria-store/ holds code libraries
that multiple projects consume. gecog's .gitmodules makes this split explicit —
each datalad-url field distinguishes project-local from lab-shared by the
store path.
GitHub and GitLab siblings: metadata, not content¶
GitHub does not support the git-annex protocol and has a 100 MB file-size limit. A GitHub sibling therefore holds metadata only: commit history, directory layout, scripts, configs, YAML, notes. Adding one follows the standard DataLad flow:
# Register the sibling
datalad siblings add --name github --url git@github.com:org/repo.git
# Push metadata only (no annex content)
datalad push --to github --data nothing
A collaborator who clones from GitHub gets the full git history and directory
structure but hollow symlinks for all annexed files. Running
datalad get raw/sub-01/ against the RIA sibling fills the content. The two
siblings are complementary: GitHub for sharing and browsing; RIA store for
data.
GitLab siblings follow the same model. A GitLab Pages target can additionally
serve static site outputs from the docs/ tree when triggered by CI.
The push vs publish asymmetry¶
-
datalad push --to <sibling>sends both git metadata and annex content. Use this for full data sync to a RIA store or any content-capable remote. -
datalad push --to <sibling> --data nothingsends only git metadata — commits, trees, notes, configs. Use this for GitHub/GitLab remotes that should be human-readable mirrors without becoming data stores.
In practice: push --to ria-store nightly or after each pipeline run; push
--to github --data nothing for discoverability and audit. Both are one-command
operations from the superdataset root. The asymmetry is invisible in day-to-day
use once siblings are configured, but understanding it matters when planning
backup and sharing strategy.
Preview-first sibling helpers in projio¶
projio's src/projio/helpers/ provides thin wrappers for provisioning GitHub,
GitLab, and RIA siblings. All helpers follow a preview-first contract: they
print the command they would run and require --yes to actually execute. This
prevents mistakes before they register a sibling at the wrong path or with
wrong permissions.
The MCP tool datalad_siblings() (exposed as mcp__projio__datalad_siblings)
returns the current sibling list for a dataset. Because DataLad commands must
run in the labpy conda environment (where git-annex lives), projio's tools
wrap the invocation rather than calling DataLad bare. The memory entry
feedback_datalad_conda_wrap.md captures this convention: bare conda datalad
without conda run fails silently.
Permissions sync¶
permissions_sync() in projio reconciles .claude/settings.json tool
permissions for a project's known siblings and subdatasets. This is separate
from DataLad's own access control: DataLad controls who can push to a sibling;
projio controls which MCP tools and Bash patterns the agent is allowed to
execute. The overlap is in sibling management — adding a new sibling should
also update the agent's allow-list so datalad push --to <new-sibling> is
pre-approved in the project's .claude/settings.json. The honest state:
permissions sync is a helper tool, not a guarantee; the user must review and
commit the resulting settings changes.
Further reading¶
- DataLad handbook — covers
datalad push, sibling setup, SSH and GitHub/GitLab configurations; RIA store creation and usage. - git-annex special remotes — the protocol layer underlying DataLad siblings, including
ria+file://andria+ssh://transports.