Intruduction ============ Tabular Data-Processing Workflow ---------------------------------- `Titanic `_ is perhaps the most familiar machine learning task for data scientists. The sub table after sampling is shown below: .. csv-table:: Titanic Origin Data :file: csv/origin.csv You must notice that such raw table cannot be used in data-mining or machine-learning. We should fill the missing value, encoding the category value, and so on. In order to introduce the practical problems this project wants to solve, I want to introduce a concept: ``feature group``. Feature Group ~~~~~~~~~~~~~~~~~ Except the columns that cannot provide entity specific properties, like ``id``, the remaining columns are called ``features``. .. note:: You can find ``column_descriptions``'s definition in :class:`autoflow.manager.data_manager.DataManager` If some ``features`` have similar properties, they are containing in a same ``feature group``. .. note:: You can find some ``feature group``'s examples and practices in :class:`autoflow.hdl.hdl_constructor.HDL_Constructor` nan ::::::::: ``nan`` is abbreviation of ``Not a Number``, indicating that this column has missing value, like this: >>> from numpy import NaN >>> import pandas as pd >>> import numpy as np >>> column = [1, 2, 3, 4, NaN] num ::::::::: ``num`` is abbreviation of ``numerical``, indicating that this column are all numerical value. .. note:: Only ``num`` `feature group` can used in estimating phase For example: >>> column = [1, 2, 3, 4, 5] cat :::::::::: ``cat`` is abbreviation of ``categorical``, indicating this column has `any` string-type value. For example: >>> column = [1, 2, 3, 4, "a"] num_nan :::::::::: ``num_nan`` is abbreviation of ``numerical NaN``, indicating this column is full of numbers except for missing values. For example: >>> column = [1, 2, 3, 4, NaN] cat_nan :::::::::: ``cat_nan`` is abbreviation of ``categorical NaN``, indicating this column has at least one string other than the missing value. For example: >>> column = [1, 2, 3, "a", NaN] highR_nan :::::::::: ``highR_nan`` is abbreviation of ``high ratio NaN``, indicating this column has most of this column is missing. For example: >>> column = [1, 2, NaN, NaN, NaN] >>> np.count_nonzero(pd.isna(column)) / column.size 0.6 NaN ratio is 0.6, more than 0.5 (default highR_nan_threshold) lowR_nan :::::::::: ``highR_nan`` is abbreviation of ``high ratio NaN``, indicating this this column has most of this column is missing. For example: >>> column = [1, 2, 3, NaN, NaN] >>> np.count_nonzero(pd.isna(column)) / column.size 0.4 NaN ratio is 0.4, less than 0.5 (default ``highR_nan_threshold``) highC_cat :::::::::: ``highC_cat`` is abbreviation of ``high cardinality ratio categorical``, indicating this this column is a categorical column (see in :ref:`cat`), and the unique value of this column divided by the total number of this column is more than ``highC_cat_threshold`` . For example: >>> column = ["a", "b", "c", "d", "d"] >>> rows = len(column) >>> np.unique(column).size / rows 0.8 cardinality ratio is 0.8, more than 0.5 (default ``highC_cat_threshold``) lowR_cat :::::::::: ``lowR_cat`` is abbreviation of ``low cardinality ratio categorical``, indicating this this column is a categorical column (see in :ref:`cat`), and the unique value of this column divided by the total number of this column is less than ``lowR_cat_threshold`` . For example: >>> column = ["a", "b", "d", "d", "d"] >>> rows = len(column) >>> np.unique(column).size / rows 0.4 cardinality ratio is 0.8, less than 0.5 (default ``lowR_cat_threshold``) Work Flow ~~~~~~~~~~ After defining a concept: ``feature group``, ``Workflow`` is the next important concept. You can consider the whole machine-learning training and testing procedure as a directed acyclic graph(DAG), except ETL or other data prepare and feature extract procedure. In this graph , you can consider nodes as ``feature group``, edges as `data-processing or estimating algorithms`. Each edges' tail node is a ``feature group`` **before processing**, each edges' head node is a other ``feature group`` **after processing**. You should keep in mind that, each edge represents **one** algorithm or **a list of** algorithms. For example, after a series of data-processing, single :ref:`num` (numerical) `feature group` is reserved, we should do estimating(`fit features to target column`): .. graphviz:: digraph estimating { "num" -> "target" [ label="{lightgbm, random_forest}" ]; } In this figure we can see: ``lightgbm`` and ``random_forest`` are candidate algorithms. Some computer scientists said, ``AutoML`` is a ``CASH`` problem (Combined Algorithm Selection and Hyper-parameter optimization problem). In fact, the algorithm selection on the edge allows this ``workflow`` to be called a ``workflow space``. Hear is the ``workflow space`` figure for `Titanic `_ task. .. image:: images/workflow_space.png Instance In Titanic ~~~~~~~~~~~~~~~~~~~~ You may be curious about the ``workflow space`` picture above, want to know how it work. Let me introduce the processing details step by step. **First step**, data manager(:class:`autoflow.manager.data_manager.DataManager`) split raw data into three ``feature group``: :ref:`nan`, :ref:`highR_nan`, :ref:`cat` and :ref:`num`. like this: .. csv-table:: First Step : Split By Data Manager :file: csv/origin_split.csv This corresponds to this figure: .. graphviz:: digraph estimating { "data" -> "cat" [ label="data_manager: cat" ]; "data" -> "num" [ label="data_manager: num" ]; "data" -> "nan" [ label="data_manager: nan" ]; "data" -> "highR_nan" [ label="data_manager: highR_nan" ]; } ----------------------------------------------------------------------------------- **Second step**, ``highR_nan_imputer`` should process :ref:`highR_nan` `feature group` to :ref:`nan`, ``merge`` (means don't do any thing, just rename the `feature group`) and ``drop`` are candidate algorithms. in this case, we choose ``drop`` option. .. csv-table:: Second Step : process highR nan :file: csv/process_highR_nan.csv This corresponds to this figure: .. graphviz:: digraph estimating { "data" -> "cat" [ label="data_manager: cat" ]; "data" -> "num" [ label="data_manager: num" ]; "data" -> "nan" [ label="data_manager: nan" ]; "data" -> "highR_nan" [ label="data_manager: highR_nan" ]; "highR_nan" -> "nan" [ label="{operate.drop, operate.keep_going}" ]; } ----------------------------------------------------------------------------------- **Third step**, using ``operate.split.cat_num`` algorithm to split :ref:`nan` to two feature groups: :ref:`cat_nan` and :ref:`num_nan`. .. csv-table:: Third Step : Split Categorical and Numerical :file: csv/split_cat_num.csv This corresponds to this figure: .. graphviz:: digraph estimating { "data" -> "cat" [ label="data_manager: cat" ]; "data" -> "num" [ label="data_manager: num" ]; "data" -> "nan" [ label="data_manager: nan" ]; "data" -> "highR_nan" [ label="data_manager: highR_nan" ]; "highR_nan" -> "nan" [ label="{operate.drop, operate.keep_going}" ]; "nan" -> "cat_nan" [ label="operate.split.cat_num: cat_nan" ]; "nan" -> "num_nan" [ label="operate.split.cat_num: num_nan" ]; } ----------------------------------------------------------------------------------- **Fourth step**, fill :ref:`cat_nan` to :ref:`cat`, fill :ref:`num_nan` to :ref:`num`. .. csv-table:: Fourth Step : Fill NaN :file: csv/fill.csv This corresponds to this figure: .. graphviz:: digraph estimating { "data" -> "cat" [ label="data_manager: cat" ]; "data" -> "num" [ label="data_manager: num" ]; "data" -> "nan" [ label="data_manager: nan" ]; "data" -> "highR_nan" [ label="data_manager: highR_nan" ]; "highR_nan" -> "nan" [ label="{operate.drop, operate.keep_going}" ]; "nan" -> "cat_nan" [ label="operate.split.cat_num: cat_nan" ]; "nan" -> "num_nan" [ label="operate.split.cat_num: num_nan" ]; "cat_nan" -> "cat" [ label="impute.fill_cat" ]; "num_nan" -> "num" [ label="impute.fill_num" ]; } ----------------------------------------------------------------------------------- **Fifth step**, using ``operate.split.cat`` algorithm to split :ref:`cat` to two feature groups: :ref:`highC_cat` and :ref:`lowR_cat`. .. csv-table:: Fifth Step : Split Categorical to :ref:`highC_cat` and :ref:`lowR_cat` :file: csv/split_cat.csv This corresponds to this figure: .. graphviz:: digraph estimating { "data" -> "cat" [ label="data_manager: cat" ]; "data" -> "num" [ label="data_manager: num" ]; "data" -> "nan" [ label="data_manager: nan" ]; "data" -> "highR_nan" [ label="data_manager: highR_nan" ]; "highR_nan" -> "nan" [ label="{operate.drop, operate.keep_going}" ]; "nan" -> "cat_nan" [ label="operate.split.cat_num: cat_nan" ]; "nan" -> "num_nan" [ label="operate.split.cat_num: num_nan" ]; "cat_nan" -> "cat" [ label="impute.fill_cat" ]; "num_nan" -> "num" [ label="impute.fill_num" ]; "cat" -> "highC_cat" [ label="operate.split.cat: highC_cat" ]; "cat" -> "lowR_cat" [ label="operate.split.cat: lowR_cat" ]; } ----------------------------------------------------------------------------------- **Sixth step**, we encode :ref:`highC_cat` to :ref:`num` by ``label_encoder``, we encode :ref:`lowR_cat` to :ref:`num` by ``one_hot_encoder``, .. csv-table:: Sixth Step : Encoding Categorical to Numerical :file: csv/encode.csv This corresponds to this figure: .. graphviz:: digraph estimating { "data" -> "cat" [ label="data_manager: cat" ]; "data" -> "num" [ label="data_manager: num" ]; "data" -> "nan" [ label="data_manager: nan" ]; "data" -> "highR_nan" [ label="data_manager: highR_nan" ]; "highR_nan" -> "nan" [ label="{operate.drop, operate.keep_going}" ]; "nan" -> "cat_nan" [ label="operate.split.cat_num: cat_nan" ]; "nan" -> "num_nan" [ label="operate.split.cat_num: num_nan" ]; "cat_nan" -> "cat" [ label="impute.fill_cat" ]; "num_nan" -> "num" [ label="impute.fill_num" ]; "cat" -> "highC_cat" [ label="operate.split.cat: highC_cat" ]; "highC_cat" -> "num" [ label="encode.label" ]; "cat" -> "lowR_cat" [ label="operate.split.cat: lowR_cat" ]; "lowR_cat" -> "num" [ label="encode.one_hot" ]; } ----------------------------------------------------------------------------------- **Seventh step**, finally, we finish all the data preprocessing phase, now we should do estimating. ``lightgbm`` and ``random_forest`` are candidate algorithms. This corresponds to this figure: .. graphviz:: digraph estimating { "data" -> "cat" [ label="data_manager: cat" ]; "data" -> "num" [ label="data_manager: num" ]; "data" -> "nan" [ label="data_manager: nan" ]; "data" -> "highR_nan" [ label="data_manager: highR_nan" ]; "highR_nan" -> "nan" [ label="{operate.drop, operate.keep_going}" ]; "nan" -> "cat_nan" [ label="operate.split.cat_num: cat_nan" ]; "nan" -> "num_nan" [ label="operate.split.cat_num: num_nan" ]; "cat_nan" -> "cat" [ label="impute.fill_cat" ]; "num_nan" -> "num" [ label="impute.fill_num" ]; "cat" -> "highC_cat" [ label="operate.split.cat: highC_cat" ]; "highC_cat" -> "num" [ label="encode.label" ]; "cat" -> "lowR_cat" [ label="operate.split.cat: lowR_cat" ]; "lowR_cat" -> "num" [ label="encode.one_hot" ]; "num" -> "target" [ label="{lightgbm, random_forest}" ]; }