Sklearn——预测结果诡异的不稳定性

问题描述

在使用Sklearn构建随机森林模型时，遇到加载已保存的模型还是会出现预测结果多次不同的现象。此前在所有可能的地方都已经设置了随机种子，百思不得其解。原始代码如下：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LassoCV
from matplotlib import pyplot as plt
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_regression, chi2, mutual_info_regression
import numpy as np
import joblib

pheno = pd.read_csv('cpgen_pheno_file.txt', sep='\t', index_col=0)
gene = pd.read_csv('log2(FPKM+1).txt', sep='\t').T
gene['type'] = gene.index.to_series().apply(lambda x: 0 if x.startswith('S') else 1)
concat = pd.concat([pheno, gene], axis=1)
print(concat.shape)

X = concat.iloc[:, 4:]
Y_1 = concat.iloc[:, 0]
Y_2 = concat.iloc[:, 1]
Y_3 = concat.iloc[:, 2]
Y_4 = concat.iloc[:, 3]

chosen_pheno = Y_1

vs = VarianceThreshold(threshold=0.4)
X = vs.fit_transform(X)
print(X.shape)
kb = SelectKBest(mutual_info_regression, k=int(X.shape[1] * 0.8))
X = kb.fit_transform(X, chosen_pheno)
print(X.shape)

X_train, X_test, Y_train, Y_test = train_test_split(X, chosen_pheno, test_size=0.2, random_state=42)
Y_train = np.log1p(Y_train)
Y_test = np.log1p(Y_test)
print(Y_test.sum())

# random forest
# rf = RandomForestRegressor(n_estimators=80, random_state=42)
# rf.fit(X_train, Y_train)
rf = joblib.load('models/Variance Only/rf_NN_9479_0.5299.pkl')
predictions = rf.predict(X_test)
print(predictions.sum())
score = rf.score(X_test, Y_test)

# 可视化
plt.figure(figsize=(10, 6))
plt.scatter(Y_test, predictions)
plt.plot([min(Y_test), max(Y_test)], [min(Y_test), max(Y_test)], color='red')
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.title(f'{score}')
plt.show()

# # 保存模型
# joblib.dump(rf, f'models/rf_80_{chosen_pheno.name}_{X.shape[1]}_{score:.4f}.pkl')

问题解决

1. 找出不稳定点

通过多次执行求和判断是否变动：print(Y_test.sum()) print(predictions.sum())。找出问题在于预测值部分，继续分析为什么会变动。

2. 设置全局随机种子

设置Numpy全局的随机种子：np.random.seed(42)。设置后，预测结果保持一致，不再变动。

3. 逐步注释

现在问题只可能出现在方差阈值筛选以及Select K Best上。注释Select K Best后，发现结果保持稳定。翻阅官方文档，该函数并没有随机种子参数，于是找到真凶mutual_info_regression，官方文档的解释如下：

random_state: int, RandomState instance or None, default=None

Determines random number generation for adding small noise to continuous variables in order to remove repeated values. Pass an int for reproducible results across multiple function calls. See Glossary.

4. 问题解决

设置全局种子是一个好习惯。