Entity Model¶
This document defines the data entities in codio: what exists in the current implementation, what is proposed to fill gaps identified during M1 review, and how the entities relate to each other.
Throughout, implemented means the entity exists as a Pydantic model or
dataclass in src/codio/ today. Proposed means the entity addresses a
concrete gap but has no model definition yet.
1. Current Entities¶
1.1 LibraryCatalogEntry¶
File: src/codio/models.py
Storage: .codio/catalog.yml under the libraries: key
Purpose: Shared, project-agnostic identity metadata for a code source.
| Field | Type | Default | Description |
|---|---|---|---|
name |
str |
required | Slug key; primary identifier |
kind |
Kind |
required | internal, external_mirror, utility |
language |
str |
"" |
Dominant language |
repo_url |
str |
"" |
Upstream repository URL |
pip_name |
str |
"" |
Package manager name |
license |
str |
"" |
Software license |
path |
str |
"" |
Local path for internal code or mirrors |
summary |
str |
"" |
Short description |
Identity: The name field is the primary key. It is a human-authored slug
(e.g. scipy, internal-utils, pandas-mirror). All other fields are
descriptive metadata.
Known ambiguities:
pathconflates package root, repository root, and arbitrary folder. There is no schema-level distinction.repo_urlis informational. It is not used for cloning, syncing, or deduplication.- There is no field for repository identity separate from library identity. A single repository may contain multiple libraries (monorepo), and a single library may be a subset of a repository (single package within a larger project).
1.2 ProjectProfileEntry¶
File: src/codio/models.py
Storage: .codio/profiles.yml under the profiles: key
Purpose: Project-local interpretation and policy for a cataloged library.
| Field | Type | Default | Description |
|---|---|---|---|
name |
str |
required | Must match a catalog key |
priority |
Priority |
tier2 |
tier1, tier2, tier3 |
runtime_import |
RuntimeImport |
reference_only |
internal, pip_only, reference_only |
decision_default |
DecisionDefault |
new |
existing, wrap, direct, new |
capabilities |
list[str] |
[] |
Free-form capability tags |
curated_note |
str |
"" |
Path to a curated note .md file |
status |
Status |
active |
active, candidate, deprecated |
notes |
str |
"" |
Short local comment |
Constraint: name must reference an existing catalog entry. The registry
validator enforces this (profile without catalog entry is an error).
1.3 LibraryRecord¶
File: src/codio/models.py
Purpose: Merged read-only view combining catalog and profile for a single
library. Not persisted; computed at query time by Registry._merge().
Contains all fields from both LibraryCatalogEntry and ProjectProfileEntry,
plus:
| Field | Type | Description |
|---|---|---|
has_profile |
bool |
Whether a project profile was found for merging |
Constructed via LibraryRecord.from_entries(catalog, profile).
1.4 RegistrySnapshot¶
File: src/codio/models.py
Purpose: Serializable payload containing the full registry state.
| Field | Type | Description |
|---|---|---|
libraries |
dict[str, LibraryCatalogEntry] |
Catalog entries keyed by name |
profiles |
dict[str, ProjectProfileEntry] |
Profile entries keyed by name |
version |
str |
Registry schema version (0.1.0) |
Returned by Registry.snapshot() and used by MCP tools (codio_registry).
1.5 ValidationResult¶
File: src/codio/models.py
Purpose: Output of Registry.validate().
| Field | Type | Description |
|---|---|---|
valid |
bool |
Pass/fail |
errors |
list[str] |
Blocking problems |
warnings |
list[str] |
Non-blocking advisories |
1.6 CodioConfig¶
File: src/codio/config.py
Purpose: Runtime configuration resolved from .projio/config.yml.
| Field | Type | Default |
|---|---|---|
catalog_path |
Path |
.codio/catalog.yml |
profiles_path |
Path |
.codio/profiles.yml |
notes_dir |
Path |
docs/reference/codelib/libraries/ |
project_root |
Path |
current working directory |
1.7 CodioRagSyncResult¶
File: src/codio/rag.py
Purpose: Outcome of registering codio sources in indexio.
| Field | Type | Description |
|---|---|---|
config_path |
Path |
Path to indexio config file |
created |
bool |
Whether the config was created |
initialized |
bool |
Whether indexio was initialized |
added |
tuple[str,...] |
Source IDs added |
updated |
tuple[str,...] |
Source IDs updated |
removed |
tuple[str,...] |
Source IDs removed |
1.8 Controlled Vocabulary Enums¶
File: src/codio/vocab.py
All enums inherit from StrEnum and carry a description property.
| Enum | Values | Used by |
|---|---|---|
Kind |
internal, external_mirror, utility |
Catalog kind |
RuntimeImport |
internal, pip_only, reference_only |
Profile |
DecisionDefault |
existing, wrap, direct, new |
Profile |
Priority |
tier1, tier2, tier3 |
Profile |
Status |
active, candidate, deprecated |
Profile |
2. Proposed Entities¶
The following entities address gaps identified during the M1 review. None of these exist in code today.
2.1 Repository¶
A first-class entity representing a version-controlled repository, distinct from any library it contains.
| Field | Type | Description |
|---|---|---|
repo_id |
str |
Canonical slug (e.g. scipy/scipy, internal/utils) |
url |
str |
Clone URL (HTTPS or SSH) |
hosting |
str |
github, gitlab, local, other |
storage |
str |
managed, attached, external |
local_path |
str |
Filesystem path when cloned locally |
default_branch |
str |
e.g. main, master |
Rationale: The current model conflates repository identity with library identity. A repository may contain multiple importable libraries (monorepo), or a library may be a subset of a repository (single package within a larger project). Making repository a first-class entity allows:
- Deduplication: two catalog entries pointing to the same repo share one
repo_id. - Sync policy: managed mirrors have clone/pull semantics tied to the repo, not the library.
- Provenance: recording how a code source entered codio requires knowing where it came from at the repository level.
2.2 CodeSource¶
A unit of code within a repository that codio tracks for intelligence purposes.
This replaces the ambiguous path field on LibraryCatalogEntry.
| Field | Type | Description |
|---|---|---|
source_id |
str |
Unique slug within the registry |
repo_id |
str |
FK to Repository |
subpath |
str |
Path within the repository (e.g. src/scipy/linalg) |
source_type |
str |
package, module, script, notebook, config |
indexable |
bool |
Whether this source should be sent to indexio |
Rationale: The current path field on catalog entries is a flat string
with no semantic type. CodeSource introduces a structured pointer: a named
sub-tree within a known repository. This makes it possible to:
- Register multiple indexable units from the same repository.
- Distinguish between a package root and an examples directory.
- Drive
codio rag syncfrom explicit source definitions rather than convention.
2.3 IndexSource¶
An indexio registration record owned by codio. Partially implemented today
via rag.py constants (codio-notes, codio-catalog), but not modeled as a
data entity.
| Field | Type | Description |
|---|---|---|
source_id |
str |
Indexio source identifier |
corpus |
str |
Indexio corpus name (e.g. codelib) |
origin |
str |
What this indexes: notes, catalog, code |
glob |
str |
File glob for multi-file sources |
path |
str |
Single file path for single-file sources |
Rationale: Currently rag.py hard-codes two source definitions. As codio
tracks more code sources (per 2.2), the set of indexio registrations should be
derived from the entity model, not maintained as constants.
2.4 Provenance (metadata, not a standalone entity)¶
A set of fields recording how and when a catalog entry was added. Could be
attached to LibraryCatalogEntry or Repository rather than modeled as a
separate entity.
| Field | Type | Description |
|---|---|---|
added_by |
str |
manual, discovery, import |
added_date |
str |
ISO date |
source_ref |
str |
Discovery session ID, import file, or empty |
3. Canonical Identifiers¶
The question of which identifiers are primary keys versus descriptive metadata is central to the entity model.
Current state¶
| Entity | Primary key | Notes |
|---|---|---|
LibraryCatalogEntry |
name |
Human-authored slug, globally unique within registry |
ProjectProfileEntry |
name |
FK to catalog name |
LibraryRecord |
name |
Inherited from catalog |
All other identifiers (repo_url, pip_name, path) are metadata. None are
used for joins, lookups, or deduplication.
Proposed state¶
| Entity | Primary key | Notes |
|---|---|---|
Repository |
repo_id |
Canonical slug; used for deduplication |
CodeSource |
source_id |
Unique within registry; references repo_id |
IndexSource |
source_id |
Matches indexio's source identifier |
Catalog |
name |
Unchanged; gains optional FK to repo_id |
Should repo_id become the primary key for catalog entries? No. A library
and a repository are different things. The library name remains the primary
key in the catalog. The repo_id becomes an optional foreign key linking a
library to its source repository. Libraries without a repository (e.g.
reference-only entries, conceptual groupings) remain valid.
repo_url and pip_name remain metadata. They are useful for display and
for humans, but they are not stable identifiers. URLs change when repositories
move. Package names can differ from library names (e.g. scikit-learn vs
sklearn). The repo_id slug is the stable canonical reference.
4. Entity Relationships¶
Repository (repo_id)
|
+--< CodeSource (source_id, repo_id)
| |
| +--- indexed by --> IndexSource (source_id)
|
+--< LibraryCatalogEntry (name, repo_id?)
|
+--< ProjectProfileEntry (name)
|
+--- merged into --> LibraryRecord (name)
|
+--- curated_note --> .md file on disk
Key relationships:
- Catalog to Profile: one-to-one. A profile's
namemust match a catalog entry. A catalog entry may exist without a profile (warned, not an error). - Catalog to Repository (proposed): many-to-one. Multiple catalog entries
may reference the same
repo_id(e.g. different packages in a monorepo). The FK is optional; entries without a repository are valid. - Repository to CodeSource (proposed): one-to-many. A repository contains one or more trackable code sources.
- CodeSource to IndexSource (proposed): one-to-one or one-to-zero. A code source may or may not be registered in indexio.
- LibraryRecord: derived, not stored. Always computed from catalog + profile at query time.
5. Ownership Rules¶
Codio distinguishes between data it owns (writes, validates, can modify) and data it references (reads, links to, does not modify).
Owned by codio¶
| Artifact | Location | Writable |
|---|---|---|
| Library catalog | .codio/catalog.yml |
Yes |
| Project profiles | .codio/profiles.yml |
Yes |
| Curated notes | docs/reference/codelib/libraries/*.md |
Yes |
| Indexio source registrations | infra/indexio/config.yaml (codio-owned IDs only) |
Yes |
Referenced by codio¶
| Artifact | Location | Access |
|---|---|---|
| Project config | .projio/config.yml |
Read |
| Source code (internal) | Varies per path field |
Read |
| Upstream repositories | Remote URLs per repo_url |
Read |
| Indexio query results | Via indexio API |
Read |
Proposed ownership for new entities¶
- Repository metadata: Owned by codio. Stored in a new registry file
(e.g.
.codio/repos.yml) or as a section withincatalog.yml. - CodeSource definitions: Owned by codio. Derived from repository + catalog entries, or explicitly declared.
- IndexSource registrations: Owned by codio within the indexio config.
The
sync_codio_rag_sourcesfunction already uses an owned-source-ID pattern to avoid touching other tools' registrations. - Managed repository clones: Filesystem artifacts owned by the target project (not by codio's registry). Codio records their location and sync policy but does not own the git state.
Managed vs Attached Repositories¶
Two storage modes for repositories with local clones:
- Managed: Codio cloned this repository and is responsible for keeping it
updated. The
storagefield ismanaged. Sync commands (codio sync) pull upstream changes. The local path is deterministic (e.g..codio/mirrors/<repo_id>/). - Attached: The repository already exists on the filesystem (e.g. a sibling
project, a git submodule, a manually cloned directory). Codio records the
path but does not clone or pull. The
storagefield isattached. - External: No local clone. Codio has metadata only. The
storagefield isexternal.
6. Derived Views¶
These are not persisted entities. They are computed from the registry at query time.
6.1 LibraryRecord (implemented)¶
Merge of catalog + profile. Used by codio get and codio list. See
section 1.3.
6.2 Filtered Library List (implemented)¶
Registry.list() accepts filters: kind, language, capability,
priority, runtime_import. Returns list[LibraryRecord].
6.3 Discovery Candidates (partially implemented)¶
codio discover searches for libraries matching a capability query. The
current implementation filters by capability tags on profiles. Proposed
enhancement: also search curated notes and code sources via indexio corpus
queries, returning ranked candidates with evidence snippets.
6.4 Registry Snapshot (implemented)¶
RegistrySnapshot containing all catalog entries and profiles. Used by the
codio_registry MCP tool for bulk export.
6.5 Validation Report (implemented)¶
ValidationResult with errors and warnings. Checks: orphan profiles, invalid
vocab values, missing curated note files, catalog entries without profiles.
6.6 Proposed: Repository Summary¶
Aggregate view grouping all catalog entries by repo_id. Would show which
libraries come from the same repository, their collective status, and sync
state for managed mirrors.
6.7 Proposed: Index Coverage Report¶
Which code sources are registered in indexio and which are not. Derived from cross-referencing CodeSource entities with IndexSource registrations.
7. Open Questions¶
These items require resolution in later milestones.
-
Repository storage location. Should repository metadata live in a separate
.codio/repos.ymlfile, or as a new top-level section incatalog.yml? A separate file avoids schema migration but adds another file to manage. -
CodeSource granularity. Is the right unit a Python package, a directory tree, a single module, or a file glob? Different use cases (indexing, discovery, import analysis) may need different granularities.
-
Slug conventions for
repo_id. Should it follow GitHub'sowner/repoconvention, or use a flat namespace? Flat slugs are simpler but risk collision. Namespaced slugs require a hosting-provider convention for local-only repositories. -
Provenance tracking scope. Should provenance be recorded for every catalog entry, or only for entries added via automated discovery? Manual entries have implicit provenance (the author), so formal tracking may add overhead without value.
-
Sync policy model. Managed mirrors need a sync policy: frequency, branch tracking, conflict resolution. This may warrant its own entity or may be fields on Repository. The design depends on whether codio will invoke git operations directly or delegate to an external tool.
-
Multi-project catalog sharing. The current model assumes one catalog per project. If multiple projects share a catalog (e.g. via a git submodule or symlink), the profile layer still varies per project, but the catalog identity semantics change. This affects whether
repo_iduniqueness is per-catalog or global. -
Capability taxonomy. Capabilities are currently free-form strings on profiles. As discovery improves, a controlled vocabulary or hierarchical taxonomy for capabilities may be needed. This interacts with indexio's corpus structure.
-
Version pinning. Neither catalog nor profile tracks which version of a library is in use. For managed mirrors this is implicit (whatever is cloned). For pip-installed libraries, the version comes from
requirements.txtorpyproject.toml, not from codio. Whether codio should record version information or defer to existing package management tools is an open question.