07. 检查点与热启动

检查点

为了防止优化过程异常中断(比如计算机异常断电,用户KeyBoardInterrupt等),我们开发了检查点与热启动机制。检查点机制保证每迭代checkpoint_freq次后将优化器以FMinResult的形式存储在硬盘上(文件路径为checkpoint_file)。

[1]:
from ultraopt import fmin, FMinResult
from ultraopt.tests.mock import config_space, evaluate
import sys
sys.tracebacklimit = 0 # limit traceback infomation

我们运行一个样例,并且在运行完之前使用KeyboardInterrupt将其中断:

[2]:
result = fmin(evaluate, config_space,
              checkpoint_file="checkout.pkl", # 检查点保存的路径
              checkpoint_freq=1,  # 保存检查点的频率,默认为 10, 为了更及时地保存优化器状态,这里设置为 1
              n_iterations=100000,  # 设置一个很大的值,运行到一半 我们中断程序
            )
  0%|          | 177/100000 [00:13<2:02:44, 13.55trial/s, best loss: 0.437]
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
AssertionError
---------------------------------------------------------------------------

因为如果在保存checkpoint_file的时候中断的话会导致检查点文件不完整,UltraOpt的机制是会形成检查点备份文件:

[3]:
!ls -lh *.pkl *.bak
-rw-r--r-- 1 tqc tqc 699K 12月 29 12:15 checkout.pkl
-rw-r--r-- 1 tqc tqc 699K 12月 29 12:15 checkout.pkl.bak
[4]:
from joblib import load

如果加载保存不完整的检查点文件,会报错

[5]:
result = load("checkout.pkl")
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
AttributeError: 'EOFError' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
AssertionError
---------------------------------------------------------------------------

但我们可以加载检查点备份文件:

[8]:
checkout_content = load("checkout.pkl.bak")

检查点保存的内容本质上是优化器, UltraOpt的设计哲学是以优化器为中心,优化器承载了优化过程中的全部状态,所以我们只需要在检查点中保存优化器 :

[9]:
type(checkout_content)
[9]:
ultraopt.optimizer.bo.etpe_opt.ETPEOptimizer

我们可以用ultraopt.FMinResult这个数据结构包装优化器,这个数据结构也是ultraopt.fmin的返回值:

[20]:
result = FMinResult(checkout_content)

对于加载得到的FMinResult,我们可以像之前的教程一样对优化结果优化过程进行数据分析:

[21]:
result
[21]:
+---------------------------------+
| HyperParameters | Optimal Value |
+-----------------+---------------+
| x0              | 0.3409        |
| x1              | 0.1209        |
+-----------------+---------------+
| Optimal Loss    | 0.4366        |
+-----------------+---------------+
| Num Configs     | 176           |
+-----------------+---------------+

随迭代数的拟合曲线

[22]:
result.plot_convergence(yscale="log");
../_images/_tutorials_07._Checkpoint_and_Warmstart_20_0.png

随时间的拟合曲线

[23]:
result.plot_convergence_over_time(yscale="log");
../_images/_tutorials_07._Checkpoint_and_Warmstart_22_0.png

随时间的运行数

[24]:
result.plot_finished_over_time();
../_images/_tutorials_07._Checkpoint_and_Warmstart_24_0.png

热启动

在优化过程异常中断后,如果我们想重启优化过程,需要指定之前的运行结果:previous_result参数

[25]:
result = fmin(evaluate, config_space,
              checkpoint_file="checkout.pkl",
              checkpoint_freq=1,
              previous_result="checkout.pkl.bak", # 之前的运行结果
              n_iterations=20,  # 只运行20次
            )
100%|██████████| 20/20 [00:03<00:00,  6.20trial/s, best loss: 0.437]

我们看到虽然只运行了20次,但是之前的优化过程都是有记录的:

[30]:
result
[30]:
+---------------------------------+
| HyperParameters | Optimal Value |
+-----------------+---------------+
| x0              | 0.3409        |
| x1              | 0.1209        |
+-----------------+---------------+
| Optimal Loss    | 0.4366        |
+-----------------+---------------+
| Num Configs     | 215           |
+-----------------+---------------+

随时间的运行数曲线也可以正常绘制,各种图表反映了中断前和恢复后的状态:

[28]:
result.plot_finished_over_time();
../_images/_tutorials_07._Checkpoint_and_Warmstart_31_0.png
[29]:
result.plot_convergence_over_time(yscale="log");
../_images/_tutorials_07._Checkpoint_and_Warmstart_32_0.png