HDL(Hyper-param Description Language)¶
-
class
autoflow.hdl.hdl_constructor.
HDL_Constructor
(DAG_workflow: Union[str, Dict[str, Any]] = 'generic_recommend', hdl_bank_path=None, hdl_bank=None, hdl_metadata=<frozendict {}>, included_classifiers=('adaboost', 'catboost', 'decision_tree', 'extra_trees', 'gaussian_nb', 'knn', 'linearsvc', 'svc', 'lightgbm', 'logistic_regression', 'random_forest', 'sgd'), included_regressors=('adaboost', 'bayesian_ridge', 'catboost', 'decision_tree', 'elasticnet', 'extra_trees', 'gaussian_process', 'knn', 'kernel_ridge', 'linearsvr', 'lightgbm', 'random_forest', 'sgd'), included_highR_nan_imputers=('operate.drop', 'operate.keep_going'), included_imputers=('impute.adaptive_fill',), included_highC_cat_encoders=('operate.drop', 'encode.ordinal', 'encode.cat_boost'), included_cat_encoders=('encode.one_hot', 'encode.ordinal', 'encode.cat_boost'), num2purified_workflow=<frozendict {'num->scaled': ['scale.standardize', 'operate.keep_going'], 'scaled->purified': ['operate.keep_going', 'transform.power']}>, text2purified_workflow=<frozendict {'text->tokenized': 'text.tokenize.simple', 'tokenized->purified': ['text.topic.tsvd', 'text.topic.lsi', 'text.topic.nmf']}>, date2purified_workflow=<frozendict {}>, purified2final_workflow=<frozendict {'purified->final': ['operate.keep_going']}>)[source]¶ HDL
is abbreviation of Hyper-parameter Descriptions Language. It describes an abstract hyperparametric space that independent with concrete implementation.HDL_Constructor
is a class who is responsible for translating dict-typeDAG-workflow
intoH.D.L
.If
DAG-workflow
didn’t be explicit assign (str “generic_recommend” is default ), a genericDAG-workflow
will be recommend by analyzing input data in doingrun()
.And then, by using function
run()
,DAG-workflow
will be translated toHDL
.- Parameters
DAG_workflow (str or dict, default="generic_recommend") –
directed acyclic graph (DAG) workflow to describe the machine-learning procedure.
By default, this value is “generic_recommend”, means HDL_Constructor will analyze the training data to recommend a valid DAG workflow.
If you want design DAG workflow by yourself, you can seed a dict .
hdl_bank_path (str, default=None) –
hdl_bank
is a json file which contains all the hyper-parameters of the algorithm.hdl_bank_path
is this file’s path. If it is None,autoflow/hdl/hdl_bank.json
will be choosed.hdl_bank (dict, default=None) – If you pass param
hdl_bank_path=None
and passhdl_bank
as a dict, program will not loadhdl_bank.json
, it uses passedhdl_bank
directly.included_classifiers (list or tuple) –
active if
DAG_workflow="generic_recommend"
, and all of the following params will active in such situation.It decides which classifiers will consider in the algorithm selection.
included_regressors (list or tuple) – It decides which regressors will consider in the algorithm selection.
included_highR_nan_imputers (list or tuple) –
highR_nan
is a feature_group, meansNaN
has a high ratio in a column.for example:
>>> from numpy import NaN >>> column = [1, 2, NaN, NaN, NaN] # nan ratio is 60% , more than 50% (default highR_nan_threshold)
highR_nan_imputers
algorithms will handle such columns contain high ratio missing value.included_cat_nan_imputers (list or tuple) –
cat_nan
is a feature_group, means a categorical feature column containsNaN
value.for example:
>>> column = ["a", "b", "c", "d", NaN]
cat_nan_imputers
algorithms will handle such columns.included_num_nan_imputers (list or tuple) –
num_nan
is a feature_group, means a numerical feature column containsNaN
value.for example:
>>> column = [1, 2, 3, 4, NaN]
num_nan_imputers
algorithms will handle such columns.included_highC_cat_encoders (list or tuple) –
highC_cat
is a feature_group, means a categorical feature column contains highly cardinality ratio.for example:
>>> import numpy as np >>> column = ["a", "b", "c", "d", "a"] >>> rows = len(column) >>> np.unique(column).size / rows # result is 0.8 , is higher than 0.5 (default highR_cat_ratio) 0.8
highR_cat_imputers
algorithms will handle such columns.included_lowR_cat_encoders (list or tuple) –
lowR_cat
is a feature_group, means a categorical feature column contains lowly cardinality ratio.for example:
>>> import numpy as np >>> column = ["a", "a", "a", "d", "a"] >>> rows = len(column) >>> np.unique(column).size / rows # result is 0.4 , is lower than 0.5 (default lowR_cat_ratio) 0.4
lowR_cat_imputers
algorithms will handle such columns.
-
ml_task
¶
-
data_manager
¶ - Type
autoflow.manager.data_manager.DataManager
Examples
>>> import numpy as np >>> from autoflow.manager.data_manager import DataManager >>> from autoflow.hdl.hdl_constructor import HDL_Constructor >>> hdl_constructor = HDL_Constructor(DAG_workflow={"num->target":["lightgbm"]}, ... hdl_bank={"classification":{"lightgbm":{"boosting_type": {"_type": "choice", "_value":["gbdt","dart","goss"]}}}}) >>> data_manager = DataManager(X_train=np.random.rand(3,3), y_train=np.arange(3)) >>> hdl_constructor.run(data_manager, 42, 0.5) >>> hdl_constructor.hdl {'preprocessing': {}, 'estimating(choice)': {'lightgbm': {'boosting_type': {'_type': 'choice', '_value': ['gbdt', 'dart', 'goss']}}}}
-
draw_workflow_space
(colorful=True, candidates_colors='#663366', '#663300', '#666633', '#333366', '#660033', feature_groups_colors='#0099CC', '#0066CC', '#339933', '#FFCC33', '#33CC99', '#FF0033', '#663399', '#FF6600')[source]¶ Notes
You must install graphviz in your compute.
if you are using Ubuntu or another debian Linux, you should run:
$ sudo apt-get install graphviz
You can also install graphviz by conda:
$ conda install -c conda-forge graphviz
- Returns
graph – You can find usage of
graphviz.dot.Digraph
in https://graphviz.readthedocs.io/en/stable/manual.html- Return type
graphviz.dot.Digraph
Data Manager¶
Resource Manager¶
Tuner¶
-
class
autoflow.tuner.
Tuner
(evaluator: Union[Callable, str] = 'TrainEvaluator', search_method: str = 'smac', run_limit: int = 100, initial_runs: int = 20, search_method_params: dict = <frozendict {}>, n_jobs: int = 1, exit_processes: Optional[int] = None, limit_resource: bool = True, per_run_time_limit: float = 60, per_run_memory_limit: float = 3072, time_left_for_this_task: float = None, n_jobs_in_algorithm=1, debug=False)[source]¶ Tuner
if class who agent an abstract search process.- Parameters
evaluator (callable, str) –
evaluator
is a function or callable class (implement magic method__call__
) or string-indicator.evaluator
can receive a shp(SMAC Hyper Param,ConfigSpace.ConfigurationSpace
),and return a dict ,which contains such keys:
loss
, you can think of it as negative reward.status
, a string ,SUCCESS
means fine,FAILED
means crashed.
As default, “TrainEvaluator” is the string-indicator of
autoflow.evaluation.train_evaluator.TrainEvaluator
.search_method (str) –
Specific searching method,
random
,smac
,grid
are available.random
Random Search Algorithm,grid
Grid Search Algorithm,smac
Bayes Search by SMAC Algorithm.
run_limit (int) – Limitation of running step.
initial_runs (int) –
If you choose
smac
algorithm,you should realize the SMAC algorithm has a initialize procedure,
The algorithm needs enough initial runs to get enough experience.
This param will be omitted if
random
orgrid
is selected.search_method_params (dict) – Configuration for specific search method.
n_jobs (int) –
n_jobs
searching process will start.exit_processes (int) –
limit_resource (bool) – If
limit_resource = True
, a searching trial will be killed if it use more CPU times or memory.per_run_time_limit (float) –
will active if
limit_resource = True
.a searching trial will be killed if it use CPU times more than
per_run_time_limit
.per_run_memory_limit (float) –
will active if
limit_resource = True
.a searching trial will be killed if it use memory more than
per_run_memory_limit
.time_left_for_this_task (float) –
will active if
limit_resource = True
.a searching task will be killed if it’s totally run time more than
time_left_for_this_task
.debug (bool) –
For debug mode.
Exception will be re-raised if
debug = True