API

Primary API

provenance([version, repo, name, …]) Decorates a function so that all inputs and outputs are cached.
load_artifact(artifact_id) Loads and returns the Artifact with the artifact_id from the default repo.
load_proxy(artifact_id) Loads and returns the ArtifactProxy with the artifact_id from the default repo.
ensure_proxies(*parameters) Decorator that ensures that the provided parameters are always arguments of type ArtifactProxy.
promote(artifact_or_id, to_repo[, from_repo])
provenance_set([set_labels, initial_set, …])
capture_set([labels, initial_set])
create_set(artifact_ids[, labels])
load_set_by_id(set_id) Loads and returns the ArtifactSet with the set_id from the default repo.
load_set_by_name(set_name) Loads and returns the ArtifactSet with the set_name from the default repo.
archive_file(filename[, name, …]) (beta) Copies or moves the provided filename into the Artifact Repository so it can be used as an ArtifactProxy to inputs of other functions.

Configuration

from_config(config)
load_config(config)
load_yaml_config(filename)
current_config()
get_repo_by_name(repo_name)
set_default_repo(repo_or_name)
get_default_repo()
set_check_mutations(setting)
get_check_mutations()
set_run_info_fn(fn) This hook allows you to provide a function that will be called once with a process’s run_info default dictionary.
get_use_cache()
set_use_cache(setting)
using_repo(repo_or_name)

Utils

is_proxy(obj)
lazy_dict(thunks)
lazy_proxy_dict(artifacts_or_ids[, …]) Takes a list of artifacts or artifact ids and returns a dictionary whose keys are the names of the artifacts.

Visualization

visualize_lineage

Detailed Docs

Primary API

provenance.provenance(version=0, repo=None, name=None, merge_defaults=None, ignore=None, input_hash_fn=None, remove=None, input_process_fn=None, archive_file=False, delete_original_file=False, preserve_file_ext=False, returns_composite=False, custom_fields=None, serializer=None, load_kwargs=None, dump_kwargs=None, use_cache=None, tags=None, _provenance_wrapper=<function provenance_wrapper>)[source]

Decorates a function so that all inputs and outputs are cached. Wraps the return value in a proxy that has an artifact attached to it allowing for the provenance to be tracked.

Parameters:
version : int

Version of the code that is computing the value. You should increment this number when anything that has changed to make a previous version of an artifact outdated. This could be the function itself changing, other functions or libraries that it calls has changed, or an underlying data source that is being queried has updated data.

repo : Repository or str

Which repo this artifact should be saved in. The default repo is used when none is provided and this is the recommended approach. When you pass in a string it should be the name of a repo in the currently registered config.

name : str

The name of the artifact of the function being wrapped. If not provided it defaults to the function name (without the module).

returns_composite : bool

When set to True the function should return a dictionary. Each value of the returned dict will be serialized as an independent artifact. When the composite artifact is returned as a cached value it will be a dict-like object that will lazily pull back the artifacts as requested. You should use this when you need multiple artifacts created atomically but you do not want to fetch all the them simultaneously. That way you can lazily load only the artifacts you need.

serializer : str

The name of the serializer you want to use for this artifact. The built-in ones are ‘joblib’ (the default) and ‘cloudpickle’. ‘joblib’ is optimized for numpy while ‘cloudpickle’ can serialize functions and other objects the standard python (and joblib) pickler cannot. You can also register your own serializer via the provenance.register_serializer function.

dump_kwargs : dict

A dict of kwargs to be passed to the serializer when dumping artifacts associated with this function. This is rarely used.

load_kwargs : dict

A dict of kwargs to be passed to the serializer when loading artifacts associated with this function. This is rarely used.

ignore : list, tuple, or set

A list of parameters that should be ignored when computing the input hash. This way you can mark certain parameters as invariant to the computed result. An example of this would be a parameter indicating how many cores should be used to compute a result. If the result is invariant the number of cores you would want to ignore it so the value isn’t recomputed when a different number of cores is used.

remove : list, tuple, or set

A list of parameters that should be removed prior to hashing and saving of the inputs. The distinction between this and the ignore parameter is that with the ignore the parameters the ignored parameters are still recorded. The motivation to not-record, i.e. remove, certain parameters usually driven by performance or storage considerations.

input_hash_fn : function

A function that takes a dict of all on the argument’s hashes with the structure of {‘kargs’: {‘param_a’: ‘1234hash’}, ‘varargs’: (‘deadbeef’,..)}. It should return a dict of the same shape but is able to change this dict as needed. The main use case for this function is overshadowed by the ignore parameter and so this parameter is hardly ever used.

input_process_fn : function

A function that pre-processes the function’s inputs before they are hashed or saved. The function takes a dict of all on the functions arguments with the structure of {‘kargs’: {‘param_a’: 42}, ‘varargs’: (100,..)}. It should return a dict of the same shape but is able to change this dict as needed. The main use case for this function is overshadowed by the remove parameter and the value_repr function.

merge_defaults : bool or list of parameters to be merged

When True then the wrapper introspects the argspec of the function being decorated to see what keyword arguments have default dictionary values. When a list of strings the list is taken to be the list of parameters you want to merge on. When a decorated function is called then the dictionary passed in as an argument is merged with the default dictionary. That way people only need to specify the keys they are overriding and don’t have to specify all the default values in the default dictionary.

use_cache : bool or None (default None)

use_cache False turns off the caching effects of the provenance decorator, while still tracking the provenance of artifacts. This should only be used during quick local iterations of a function to avoid having to bump the version with each change. When set to None (the default) it defers to the global provenance use_cache setting.

custom_fields : dict

A dict with types that serialize to json. These are saved for searching in the repository.

tags : list, tuple or set

Will be added to custom_fields as the value for the ‘tags’ key.

archive_file : bool, defaults False

When True then the return value of the wrapped function will be assumed to be a str or pathlike that represents a file that should be archived into the blobstore. This is a good option to use when the computation of a function can’t easily be returned as an in-memory pickle-able python value.

delete_original_file : bool, defaults False

To be used in conjunction with archive_file=True, when delete_original_file is True then the returned file will be deleted after it has been archived.

preserve_file_ext : bool, default False

To be used in conjunction with archive_file=True, when preserve_file_ext is True then id of the artifact archived will be the hash of the file contents plus the file extension of the original file. The motivation of setting this to True would be if you wanted to be able to look at the contents of a blobstore on disk and being able to preview the contents of an artifact with your regular OS tools (e.g. viewing images or videos).

Returns:
ArtifactProxy

Returns the value of the decorated function as a proxy. The proxy will act exactly like the original object/value but will have an artifact method that returns the Artifact associated with the value. This wrapped value should be used with all other functions that are wrapped with the provenance decorator as it will help track the provenance and also reduce redundant storage of a given value.

provenance.load_artifact(artifact_id)[source]

Loads and returns the Artifact with the artifact_id from the default repo.

Parameters:
artifact_id : string

See also

load_proxy

provenance.load_proxy(artifact_id)[source]

Loads and returns the ArtifactProxy with the artifact_id from the default repo.

Parameters:
artifact_id : string

See also

load_artifact

provenance.ensure_proxies(*parameters)[source]

Decorator that ensures that the provided parameters are always arguments of type ArtifactProxy.

When no parameters are passed then all arguments will be checked.

This is useful to use on functions where you want to make sure artifacts are being passed in so lineage can be tracked.

provenance.promote(artifact_or_id, to_repo, from_repo=None)[source]
provenance.provenance_set(set_labels=None, initial_set=None, set_labels_fn=None)[source]
provenance.capture_set(labels=None, initial_set=None)[source]
provenance.create_set(artifact_ids, labels=None)[source]
provenance.load_set_by_id(set_id)[source]

Loads and returns the ArtifactSet with the set_id from the default repo.

Parameters:
set_id : string

See also

load_set_by_name

provenance.load_set_by_name(set_name)[source]

Loads and returns the ArtifactSet with the set_name from the default repo.

Parameters:
set_name : string

See also

load_set_by_id, load_set_by_labels

provenance.archive_file(filename, name=None, delete_original=False, custom_fields=None, preserve_ext=False)[source]

(beta) Copies or moves the provided filename into the Artifact Repository so it can be used as an ArtifactProxy to inputs of other functions.

Parameters:
archive_file : bool, defaults False

When True then the return value of the wrapped function will be assumed to be a str or pathlike that represents a file that should be archived into the blobstore. This is a good option to use when the computation of a function can’t easily be returned as an in-memory pickle-able python value.

delete_original : bool, defaults False

When delete_original_file True the file will be deleted after it has been archived.

preserve_file_ext : bool, default False

When True then id of the artifact archived will be the hash of the file contents plus the file extension of the original file. The motivation of setting this to True would be if you wanted to be able to look at the contents of a blobstore on disk and being able to preview the contents of an artifact with your regular OS tools (e.g. viewing images or videos).

Configuration

provenance.from_config(config)[source]
provenance.load_config(config)[source]
provenance.load_yaml_config(filename)[source]
provenance.current_config()[source]
provenance.get_repo_by_name(repo_name)[source]
provenance.set_default_repo(repo_or_name)[source]
provenance.get_default_repo()[source]
provenance.set_check_mutations(setting)[source]
provenance.get_check_mutations()[source]
provenance.set_run_info_fn(fn)[source]

This hook allows you to provide a function that will be called once with a process’s run_info default dictionary. The provided function can then update this dictionary with other useful information you wish to track, such as git ref or build server id.

provenance.get_use_cache()[source]
provenance.set_use_cache(setting)[source]
provenance.using_repo(repo_or_name)[source]

Utils

provenance.is_proxy(obj)[source]
provenance.lazy_dict(thunks)[source]
provenance.lazy_proxy_dict(artifacts_or_ids, group_artifacts_of_same_name=False)[source]

Takes a list of artifacts or artifact ids and returns a dictionary whose keys are the names of the artifacts. The values will be lazily loaded into proxies as requested.

Parameters:
artifacts_or_ids : collection of artifacts or artifact ids (strings)
group_artifacts_of_same_name: bool (default: False)
If set to True then artifacts of the same name will be grouped together in
one list. When set to False an exception will be raised

Visualization (beta)