autoflow.hdl package

Submodules

autoflow.hdl.hdl_constructor module

class autoflow.hdl.hdl_constructor.HDL_Constructor(DAG_workflow: Union[str, Dict[str, Any]] = 'generic_recommend', hdl_bank_path=None, hdl_bank=None, hdl_metadata=<frozendict {}>, included_classifiers=('adaboost', 'catboost', 'decision_tree', 'extra_trees', 'gaussian_nb', 'knn', 'linearsvc', 'svc', 'lightgbm', 'logistic_regression', 'random_forest', 'sgd'), included_regressors=('adaboost', 'bayesian_ridge', 'catboost', 'decision_tree', 'elasticnet', 'extra_trees', 'gaussian_process', 'knn', 'kernel_ridge', 'linearsvr', 'lightgbm', 'random_forest', 'sgd'), included_highR_nan_imputers=('operate.drop', 'operate.keep_going'), included_imputers=('impute.adaptive_fill',), included_highC_cat_encoders=('operate.drop', 'encode.ordinal', 'encode.cat_boost'), included_cat_encoders=('encode.one_hot', 'encode.ordinal', 'encode.cat_boost'), num2purified_workflow=<frozendict {'num->scaled': ['scale.standardize', 'operate.keep_going'], 'scaled->purified': ['operate.keep_going', 'transform.power']}>, text2purified_workflow=<frozendict {'text->tokenized': 'text.tokenize.simple', 'tokenized->purified': ['text.topic.tsvd', 'text.topic.lsi', 'text.topic.nmf']}>, date2purified_workflow=<frozendict {}>, purified2final_workflow=<frozendict {'purified->final': ['operate.keep_going']}>)[source]

Bases: autoflow.utils.klass.StrSignatureMixin

HDL is abbreviation of Hyper-parameter Descriptions Language. It describes an abstract hyperparametric space that independent with concrete implementation.

HDL_Constructor is a class who is responsible for translating dict-type DAG-workflow into H.D.L .

If DAG-workflow didn’t be explicit assign (str “generic_recommend” is default ), a generic DAG-workflow will be recommend by analyzing input data in doing run().

And then, by using function run() , DAG-workflow will be translated to HDL.

Parameters
  • DAG_workflow (str or dict, default="generic_recommend") –

    directed acyclic graph (DAG) workflow to describe the machine-learning procedure.

    By default, this value is “generic_recommend”, means HDL_Constructor will analyze the training data to recommend a valid DAG workflow.

    If you want design DAG workflow by yourself, you can seed a dict .

  • hdl_bank_path (str, default=None) –

    hdl_bank is a json file which contains all the hyper-parameters of the algorithm.

    hdl_bank_path is this file’s path. If it is None, autoflow/hdl/hdl_bank.json will be choosed.

  • hdl_bank (dict, default=None) – If you pass param hdl_bank_path=None and pass hdl_bank as a dict, program will not load hdl_bank.json, it uses passed hdl_bank directly.

  • included_classifiers (list or tuple) –

    active if DAG_workflow="generic_recommend", and all of the following params will active in such situation.

    It decides which classifiers will consider in the algorithm selection.

  • included_regressors (list or tuple) – It decides which regressors will consider in the algorithm selection.

  • included_highR_nan_imputers (list or tuple) –

    highR_nan is a feature_group, means NaN has a high ratio in a column.

    for example:

    >>> from numpy import NaN
    >>> column = [1, 2, NaN, NaN, NaN]    # nan ratio is 60% , more than 50% (default highR_nan_threshold)
    

    highR_nan_imputers algorithms will handle such columns contain high ratio missing value.

  • included_cat_nan_imputers (list or tuple) –

    cat_nan is a feature_group, means a categorical feature column contains NaN value.

    for example:

    >>> column = ["a", "b", "c", "d", NaN]
    

    cat_nan_imputers algorithms will handle such columns.

  • included_num_nan_imputers (list or tuple) –

    num_nan is a feature_group, means a numerical feature column contains NaN value.

    for example:

    >>> column = [1, 2, 3, 4, NaN]
    

    num_nan_imputers algorithms will handle such columns.

  • included_highC_cat_encoders (list or tuple) –

    highC_cat is a feature_group, means a categorical feature column contains highly cardinality ratio.

    for example:

    >>> import numpy as np
    >>> column = ["a", "b", "c", "d", "a"]
    >>> rows = len(column)
    >>> np.unique(column).size / rows  # result is 0.8 , is higher than 0.5 (default highR_cat_ratio)
    0.8
    

    highR_cat_imputers algorithms will handle such columns.

  • included_lowR_cat_encoders (list or tuple) –

    lowR_cat is a feature_group, means a categorical feature column contains lowly cardinality ratio.

    for example:

    >>> import numpy as np
    >>> column = ["a", "a", "a", "d", "a"]
    >>> rows = len(column)
    >>> np.unique(column).size / rows  # result is 0.4 , is lower than 0.5 (default lowR_cat_ratio)
    0.4
    

    lowR_cat_imputers algorithms will handle such columns.

random_state
Type

int

ml_task
Type

autoflow.utils.ml_task.MLTask

data_manager
Type

autoflow.manager.data_manager.DataManager

hdl

construct by run()

Type

dict

Examples

>>> import numpy as np
>>> from autoflow.manager.data_manager import DataManager
>>> from autoflow.hdl.hdl_constructor import  HDL_Constructor
>>> hdl_constructor = HDL_Constructor(DAG_workflow={"num->target":["lightgbm"]},
...   hdl_bank={"classification":{"lightgbm":{"boosting_type":  {"_type": "choice", "_value":["gbdt","dart","goss"]}}}})
>>> data_manager = DataManager(X_train=np.random.rand(3,3), y_train=np.arange(3))
>>> hdl_constructor.run(data_manager, 42, 0.5)
>>> hdl_constructor.hdl
{'preprocessing': {}, 'estimating(choice)': {'lightgbm': {'boosting_type': {'_type': 'choice', '_value': ['gbdt', 'dart', 'goss']}}}}
draw_workflow_space(colorful=True, candidates_colors='#663366', '#663300', '#666633', '#333366', '#660033', feature_groups_colors='#0099CC', '#0066CC', '#339933', '#FFCC33', '#33CC99', '#FF0033', '#663399', '#FF6600')[source]

Notes

You must install graphviz in your compute.

if you are using Ubuntu or another debian Linux, you should run:

$ sudo apt-get install graphviz

You can also install graphviz by conda:

$ conda install -c conda-forge graphviz
Returns

graph – You can find usage of graphviz.dot.Digraph in https://graphviz.readthedocs.io/en/stable/manual.html

Return type

graphviz.dot.Digraph

generic_recommend() → Dict[str, List[Union[str, Dict[str, Any]]]][source]

Recommend a generic DAG workflow-space.

Returns

DAG_workflow

Return type

dict

get_hdl() → Dict[str, Any][source]
Returns

hdl

Return type

dict

get_hdl_dataframe() → pandas.core.frame.DataFrame[source]
get_params_in_dict(hdl_bank: dict, packages: str, phase: str, mainTask)[source]
interactive_display_workflow_space()[source]
parse_item(value: Union[dict, str]) → Tuple[str, dict, bool][source]
purify_DAG_describe()[source]
purify_step_name(step: str)[source]
run(data_manager, model_registry=None)[source]
Parameters
  • data_manager (autoflow.manager.data_manager.DataManager) –

  • highC_cat_threshold (float) –

autoflow.hdl.utils module

autoflow.hdl.utils.add_leader_model(key, leader_model, SERIES_CONNECT_LEADER_TOKEN)[source]
autoflow.hdl.utils.get_default_hp_of_cls(cls)[source]
autoflow.hdl.utils.get_hdl_bank(path: str, logger=None) → Dict[source]
autoflow.hdl.utils.get_origin_models(raw_models: List[str])[source]
autoflow.hdl.utils.is_hdl_bottom(key, value)[source]
autoflow.hdl.utils.purify_key(key: str)[source]
autoflow.hdl.utils.purify_keys(dict_: dict) → Iterator[str][source]

Module contents