API¶
Primary API¶
provenance ([version, repo, name, …]) |
Decorates a function so that all inputs and outputs are cached. |
load_artifact (artifact_id) |
Loads and returns the Artifact with the artifact_id from the default repo. |
load_proxy (artifact_id) |
Loads and returns the ArtifactProxy with the artifact_id from the default repo. |
ensure_proxies (*parameters) |
Decorator that ensures that the provided parameters are always arguments of type ArtifactProxy. |
promote (artifact_or_id, to_repo[, from_repo]) |
|
provenance_set ([set_labels, initial_set, …]) |
|
capture_set ([labels, initial_set]) |
|
create_set (artifact_ids[, labels]) |
|
load_set_by_id (set_id) |
Loads and returns the ArtifactSet with the set_id from the default repo. |
load_set_by_name (set_name) |
Loads and returns the ArtifactSet with the set_name from the default repo. |
archive_file (filename[, name, …]) |
(beta) Copies or moves the provided filename into the Artifact Repository so it can be used as an ArtifactProxy to inputs of other functions. |
Configuration¶
from_config (config) |
|
load_config (config) |
|
load_yaml_config (filename) |
|
current_config () |
|
get_repo_by_name (repo_name) |
|
set_default_repo (repo_or_name) |
|
get_default_repo () |
|
set_check_mutations (setting) |
|
get_check_mutations () |
|
set_run_info_fn (fn) |
This hook allows you to provide a function that will be called once with a process’s run_info default dictionary. |
get_use_cache () |
|
set_use_cache (setting) |
|
using_repo (repo_or_name) |
Utils¶
is_proxy (obj) |
|
lazy_dict (thunks) |
|
lazy_proxy_dict (artifacts_or_ids[, …]) |
Takes a list of artifacts or artifact ids and returns a dictionary whose keys are the names of the artifacts. |
Visualization¶
visualize_lineage |
Detailed Docs¶
Primary API
-
provenance.
provenance
(version=0, repo=None, name=None, merge_defaults=None, ignore=None, input_hash_fn=None, remove=None, input_process_fn=None, archive_file=False, delete_original_file=False, preserve_file_ext=False, returns_composite=False, custom_fields=None, serializer=None, load_kwargs=None, dump_kwargs=None, use_cache=None, tags=None, _provenance_wrapper=<function provenance_wrapper>)[source]¶ Decorates a function so that all inputs and outputs are cached. Wraps the return value in a proxy that has an artifact attached to it allowing for the provenance to be tracked.
Parameters: - version : int
Version of the code that is computing the value. You should increment this number when anything that has changed to make a previous version of an artifact outdated. This could be the function itself changing, other functions or libraries that it calls has changed, or an underlying data source that is being queried has updated data.
- repo : Repository or str
Which repo this artifact should be saved in. The default repo is used when none is provided and this is the recommended approach. When you pass in a string it should be the name of a repo in the currently registered config.
- name : str
The name of the artifact of the function being wrapped. If not provided it defaults to the function name (without the module).
- returns_composite : bool
When set to True the function should return a dictionary. Each value of the returned dict will be serialized as an independent artifact. When the composite artifact is returned as a cached value it will be a dict-like object that will lazily pull back the artifacts as requested. You should use this when you need multiple artifacts created atomically but you do not want to fetch all the them simultaneously. That way you can lazily load only the artifacts you need.
- serializer : str
The name of the serializer you want to use for this artifact. The built-in ones are ‘joblib’ (the default) and ‘cloudpickle’. ‘joblib’ is optimized for numpy while ‘cloudpickle’ can serialize functions and other objects the standard python (and joblib) pickler cannot. You can also register your own serializer via the provenance.register_serializer function.
- dump_kwargs : dict
A dict of kwargs to be passed to the serializer when dumping artifacts associated with this function. This is rarely used.
- load_kwargs : dict
A dict of kwargs to be passed to the serializer when loading artifacts associated with this function. This is rarely used.
- ignore : list, tuple, or set
A list of parameters that should be ignored when computing the input hash. This way you can mark certain parameters as invariant to the computed result. An example of this would be a parameter indicating how many cores should be used to compute a result. If the result is invariant the number of cores you would want to ignore it so the value isn’t recomputed when a different number of cores is used.
- remove : list, tuple, or set
A list of parameters that should be removed prior to hashing and saving of the inputs. The distinction between this and the ignore parameter is that with the ignore the parameters the ignored parameters are still recorded. The motivation to not-record, i.e. remove, certain parameters usually driven by performance or storage considerations.
- input_hash_fn : function
A function that takes a dict of all on the argument’s hashes with the structure of {‘kargs’: {‘param_a’: ‘1234hash’}, ‘varargs’: (‘deadbeef’,..)}. It should return a dict of the same shape but is able to change this dict as needed. The main use case for this function is overshadowed by the ignore parameter and so this parameter is hardly ever used.
- input_process_fn : function
A function that pre-processes the function’s inputs before they are hashed or saved. The function takes a dict of all on the functions arguments with the structure of {‘kargs’: {‘param_a’: 42}, ‘varargs’: (100,..)}. It should return a dict of the same shape but is able to change this dict as needed. The main use case for this function is overshadowed by the remove parameter and the value_repr function.
- merge_defaults : bool or list of parameters to be merged
When True then the wrapper introspects the argspec of the function being decorated to see what keyword arguments have default dictionary values. When a list of strings the list is taken to be the list of parameters you want to merge on. When a decorated function is called then the dictionary passed in as an argument is merged with the default dictionary. That way people only need to specify the keys they are overriding and don’t have to specify all the default values in the default dictionary.
- use_cache : bool or None (default None)
use_cache False turns off the caching effects of the provenance decorator, while still tracking the provenance of artifacts. This should only be used during quick local iterations of a function to avoid having to bump the version with each change. When set to None (the default) it defers to the global provenance use_cache setting.
- custom_fields : dict
A dict with types that serialize to json. These are saved for searching in the repository.
- tags : list, tuple or set
Will be added to custom_fields as the value for the ‘tags’ key.
- archive_file : bool, defaults False
When True then the return value of the wrapped function will be assumed to be a str or pathlike that represents a file that should be archived into the blobstore. This is a good option to use when the computation of a function can’t easily be returned as an in-memory pickle-able python value.
- delete_original_file : bool, defaults False
To be used in conjunction with archive_file=True, when delete_original_file is True then the returned file will be deleted after it has been archived.
- preserve_file_ext : bool, default False
To be used in conjunction with archive_file=True, when preserve_file_ext is True then id of the artifact archived will be the hash of the file contents plus the file extension of the original file. The motivation of setting this to True would be if you wanted to be able to look at the contents of a blobstore on disk and being able to preview the contents of an artifact with your regular OS tools (e.g. viewing images or videos).
Returns: - ArtifactProxy
Returns the value of the decorated function as a proxy. The proxy will act exactly like the original object/value but will have an artifact method that returns the Artifact associated with the value. This wrapped value should be used with all other functions that are wrapped with the provenance decorator as it will help track the provenance and also reduce redundant storage of a given value.
-
provenance.
load_artifact
(artifact_id)[source]¶ Loads and returns the
Artifact
with theartifact_id
from the default repo.Parameters: - artifact_id : string
See also
-
provenance.
load_proxy
(artifact_id)[source]¶ Loads and returns the
ArtifactProxy
with theartifact_id
from the default repo.Parameters: - artifact_id : string
See also
-
provenance.
ensure_proxies
(*parameters)[source]¶ Decorator that ensures that the provided parameters are always arguments of type ArtifactProxy.
When no parameters are passed then all arguments will be checked.
This is useful to use on functions where you want to make sure artifacts are being passed in so lineage can be tracked.
-
provenance.
load_set_by_id
(set_id)[source]¶ Loads and returns the
ArtifactSet
with theset_id
from the default repo.Parameters: - set_id : string
See also
-
provenance.
load_set_by_name
(set_name)[source]¶ Loads and returns the
ArtifactSet
with theset_name
from the default repo.Parameters: - set_name : string
See also
load_set_by_id
,load_set_by_labels
-
provenance.
archive_file
(filename, name=None, delete_original=False, custom_fields=None, preserve_ext=False)[source]¶ (beta) Copies or moves the provided filename into the Artifact Repository so it can be used as an
ArtifactProxy
to inputs of other functions.Parameters: - archive_file : bool, defaults False
When True then the return value of the wrapped function will be assumed to be a str or pathlike that represents a file that should be archived into the blobstore. This is a good option to use when the computation of a function can’t easily be returned as an in-memory pickle-able python value.
- delete_original : bool, defaults False
When delete_original_file True the file will be deleted after it has been archived.
- preserve_file_ext : bool, default False
When True then id of the artifact archived will be the hash of the file contents plus the file extension of the original file. The motivation of setting this to True would be if you wanted to be able to look at the contents of a blobstore on disk and being able to preview the contents of an artifact with your regular OS tools (e.g. viewing images or videos).
Configuration
-
provenance.
set_run_info_fn
(fn)[source]¶ This hook allows you to provide a function that will be called once with a process’s run_info default dictionary. The provided function can then update this dictionary with other useful information you wish to track, such as git ref or build server id.
Utils
-
provenance.
lazy_proxy_dict
(artifacts_or_ids, group_artifacts_of_same_name=False)[source]¶ Takes a list of artifacts or artifact ids and returns a dictionary whose keys are the names of the artifacts. The values will be lazily loaded into proxies as requested.
Parameters: - artifacts_or_ids : collection of artifacts or artifact ids (strings)
- group_artifacts_of_same_name: bool (default: False)
- If set to True then artifacts of the same name will be grouped together in
- one list. When set to False an exception will be raised
Visualization (beta)