学习资料:sklearn 中文文档
Alpha-Beta剪枝算法复习:Alpha-Beta剪枝算法(极大极小算法-人工智能) 下方为YouTube原视频,科学上网可看。

scikit-learn机器学习实验

实验要求:

《实验一》安装python, scikit-learn机器学习实验环境;下载实例数据,进行数据预处理操作;绘制可视化结果。去除噪声数据,特征变量值转换,统一量纲,归一化,可视化。
《实验二》聚类,逻辑回归或SVM

实验一

  1. 从anaconda官网:https://www.anaconda.com/download/下载安装实验环境,并添加到系统环境变量(其中默认已安装好所需要的numpy、scipy、pandas、sklearn等库)

  2. 使用老师提供的眼动跟踪数据.xlsx文件,使用Excel另存为兼容模式下的.xls文件方便使用pandas库导入。

    1
    2
    3
    4
    5
    #导入数据集
    eye_tracking_data=pd.read_excel('eye_tracking.xls', encoding='utf-8')
    eye_tracking_data.columns = ['实验编号','ImageName','鞋身类型', '鞋带类型','性别','专业','注视点个数','平均注视点持续时间','鞋带平均瞳孔直径','鞋带注视时间','鞋带注视点个数','鞋身平均瞳孔直径','鞋身注视时间','鞋身注视点个数','主观评价']
    print(eye_tracking_data.shape) #查询得数据维度为479行15列(479, 15)
    print(eye_tracking_data.tail(50))

    image.png

  3. 接下来进行数据预处理,其中前6列为类别数据特征,利用pandas的map方法来实现有序特征的映射

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    #特征映射 以后应改为用独热编码 (one-hot encoding)
    Experiment_number_mapping = {"s1":1,"s2":2,"s3":3,"s4":4,"s5":5,"s6":6,"s7":7,"s8":8,"s9":9,"s10":10,"s11":11,"s12":12,"s13":13,"s14":14,"s15":15,"s16":16,"s17":17,"s18":18,"s19":19,"s20":20}
    Photo_number_mapping = {"1jpg":1,"2jpg":2,"3jpg":3,"4jpg":4,"5jpg":5,"6jpg":6,"7jpg":7,"8jpg":8,"9jpg":9,"10jpg":10,"11jpg":11,"12jpg":12,"13jpg":13,"14jpg":14,"15jpg":15,"16jpg":16,"17jpg":17,"18jpg":18,"19jpg":19,"20jpg":20,"21jpg":21,"22jpg":22,"23jpg":23,"24jpg":24}
    Type_shoe_mapping = {"1合成革":1,"2反绒皮":2,"3网格":3,"4棉布":4,"5真皮革":5,"6帆布":6}
    Type_shoelace_mapping = {"1弹力带":1,"2魔术贴":2,"3圆形鞋带":3,"4扁形鞋带":4}
    Sex_mapping = {"1男":1,"2女":2}
    Discipline_mapping = {"1服装专业":1,"2非服装":2}

    eye_tracking_data["实验编号"] = eye_tracking_data["实验编号"].map(Experiment_number_mapping)
    eye_tracking_data["ImageName"] = eye_tracking_data["ImageName"].map(Photo_number_mapping)
    eye_tracking_data["鞋身类型"] = eye_tracking_data["鞋身类型"].map(Type_shoe_mapping)
    eye_tracking_data["鞋带类型"] = eye_tracking_data["鞋带类型"].map(Type_shoelace_mapping)
    eye_tracking_data["性别"] = eye_tracking_data["性别"].map(Sex_mapping)
    eye_tracking_data["专业"] = eye_tracking_data["专业"].map(Discipline_mapping)
    print(eye_tracking_data.head)

    输出得:
    image.png

  4. 从第二点的图中可见其中鞋身平均瞳孔直径列中数据存在“.”这样的数据,可将其转换成NaN后对缺失值进行填充或直接舍弃。数据填充函数fillna中可根据values、method等参数填入均值、中位数、邻近值等…

    1
    2
    3
    4
    5
    6
    7
    #将表中的噪声 “.”转换成 NaN
    eye_tracking_data["鞋身平均瞳孔直径"] = eye_tracking_data["鞋身平均瞳孔直径"].replace(r'\.+',np.nan,regex=True)
    #直接舍弃残缺值
    eye_tracking_data = eye_tracking_data.dropna()
    #print(eye_tracking_data.head(50))
    #或对残缺值进行填充
    #eye_tracking_data["鞋身平均瞳孔直径"] = eye_tracking_data["鞋身平均瞳孔直径"].fillna(value=0)

    image.png
    (可见473行数据被舍弃)
    image.png
    (可见473行原本为“.”的值被填充为0)

  5. 然后设定样本特征与样本输出,并划分训练集与测试集

    1
    2
    3
    4
    5
    6
    7
    #以前14列数据作为样本特征X
    X = eye_tracking_data[['实验编号','ImageName','鞋身类型', '鞋带类型','性别','专业','注视点个数','平均注视点持续时间','鞋带平均瞳孔直径','鞋带注视时间','鞋带注视点个数','鞋身平均瞳孔直径','鞋身注视时间','鞋身注视点个数']]
    #以主观评价作为样本输出y
    y = eye_tracking_data[['主观评价']]
    #划分训练集与测试集
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test=train_test_split(X,y,random_state=0)
  6. 统一量纲、归一化

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    #标准化
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler().fit(X_train)
    standardized_X = scaler.transform(X_train)
    standardized_X_test = scaler.transform(X_test)
    print("standardized_X:",standardized_X,"/n","standardized_X_test:",standardized_X_test)

    #归一化
    from sklearn.preprocessing import Normalizer
    scaler = Normalizer().fit(X_train)
    normalized_X = scaler.transform(X_train)
    normalized_X_test = scaler.transform(X_test)
    print("normalized_X:",normalized_X,"/n","normalized_X_test:",normalized_X_test)
    #二值化
    from sklearn.preprocessing import Binarizer
    binarizer = Binarizer(threshold=0.0).fit(X)
    binary_X = binarizer.transform(X)

    (另外部分对数据的预处理操作间附录中代码)

  7. 实验二中的逻辑回归

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    #Model
    from sklearn.linear_model import LogisticRegression
    from sklearn import metrics
    # 逻辑回归模型
    log_model = LogisticRegression()
    # 训练逻辑回归模型
    log_model.fit(X_train, y_train)
    # 预测y的值
    y_pred = log_model.predict(X_test)
    # 查看测试结果
    print(metrics.confusion_matrix(y_test, y_pred))
    print(metrics.classification_report(y_test, y_pred))

    实验二

    因为对眼动跟踪数据中特征间的关系不是很熟悉,以及刚开始学习scikitlearn知识,接下来采用第二个数据集进行可视化的展示及学习K-Means聚类算法。

  8. 随机创建一些二维数据作为训练集,选择二维特征数据,方便可视化。

    1
    2
    3
    4
    5
    6
    7
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets.samples_generator import make_blobs
    # X为样本特征,Y为样本簇类别, 共1000个样本,每个样本2个特征,共4个簇,簇中心在[-1,-1], [0,0],[1,1], [2,2], 簇方差分别为[0.4, 0.2, 0.2]
    X, y = make_blobs(n_samples=1000, n_features=2, centers=[[-1,-1], [0,0], [1,1], [2,2]], cluster_std=[0.4, 0.2, 0.2, 0.2], random_state =1)
    plt.scatter(X[:, 0], X[:, 1], marker='o')
    plt.show()

    image.png

  9. 使用K-Means聚类方法来做聚类,并用matplotlib画出k=2,3,4,5对应的输出图

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    from sklearn.cluster import KMeans
    from sklearn import metrics
    from sklearn.cluster import MiniBatchKMeans
    plt.subplot(2,2,1)
    y_pred = MiniBatchKMeans(n_clusters=2, batch_size = 200, random_state=1).fit_predict(X)
    score= metrics.calinski_harabaz_score(X, y_pred)
    plt.scatter(X[:, 0], X[:, 1], c=y_pred)
    plt.text(.99, .01, ('k=%d, score: %.2f' % (2,score)),
    transform=plt.gca().transAxes, size=10,
    horizontalalignment='right')
    plt.subplot(2,2,2)
    y_pred = MiniBatchKMeans(n_clusters=3, batch_size = 200, random_state=1).fit_predict(X)
    score= metrics.calinski_harabaz_score(X, y_pred)
    plt.scatter(X[:, 0], X[:, 1], c=y_pred)
    plt.text(.99, .01, ('k=%d, score: %.2f' % (3,score)),
    transform=plt.gca().transAxes, size=10,
    horizontalalignment='right')
    plt.subplot(2,2,3)
    y_pred = MiniBatchKMeans(n_clusters=4, batch_size = 200, random_state=1).fit_predict(X)
    score= metrics.calinski_harabaz_score(X, y_pred)
    plt.scatter(X[:, 0], X[:, 1], c=y_pred)
    plt.text(.99, .01, ('k=%d, score: %.2f' % (4,score)),
    transform=plt.gca().transAxes, size=10,
    horizontalalignment='right')
    plt.subplot(2,2,4)
    y_pred = MiniBatchKMeans(n_clusters=5, batch_size = 200, random_state=1).fit_predict(X)
    score= metrics.calinski_harabaz_score(X, y_pred)
    plt.scatter(X[:, 0], X[:, 1], c=y_pred)
    plt.text(.99, .01, ('k=%d, score: %.2f' % (5,score)),
    transform=plt.gca().transAxes, size=10,
    horizontalalignment='right')
    plt.show()

    image.png

  10. 逻辑回归

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    #Model
    from sklearn.linear_model import LogisticRegression
    from sklearn import metrics
    # 逻辑回归模型
    log_model = LogisticRegression()
    # 训练逻辑回归模型
    log_model.fit(X_train, y_train)
    # 预测y的值
    y_pred = log_model.predict(X_test)
    # 查看测试结果
    print(metrics.confusion_matrix(y_test, y_pred))
    print(metrics.classification_report(y_test, y_pred))

    image.png

附录(代码)

附录1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
#encoding-utf-8
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def main():
#Pre-processing数据预处理
#导入数据集
eye_tracking_data=pd.read_excel('eye_tracking.xls', encoding='utf-8')
eye_tracking_data.columns = ['实验编号','ImageName','鞋身类型', '鞋带类型','性别','专业','注视点个数','平均注视点持续时间','鞋带平均瞳孔直径','鞋带注视时间','鞋带注视点个数','鞋身平均瞳孔直径','鞋身注视时间','鞋身注视点个数','主观评价']
#print(eye_tracking_data.tail(10))
#print(eye_tracking_data.shape) #查询得数据维度为479行15列(479, 15)

#特征映射 以后应改为用独热编码 (one-hot encoding)
Experiment_number_mapping = {"s1":1,"s2":2,"s3":3,"s4":4,"s5":5,"s6":6,"s7":7,"s8":8,"s9":9,"s10":10,"s11":11,"s12":12,"s13":13,"s14":14,"s15":15,"s16":16,"s17":17,"s18":18,"s19":19,"s20":20}
Photo_number_mapping = {"1jpg":1,"2jpg":2,"3jpg":3,"4jpg":4,"5jpg":5,"6jpg":6,"7jpg":7,"8jpg":8,"9jpg":9,"10jpg":10,"11jpg":11,"12jpg":12,"13jpg":13,"14jpg":14,"15jpg":15,"16jpg":16,"17jpg":17,"18jpg":18,"19jpg":19,"20jpg":20,"21jpg":21,"22jpg":22,"23jpg":23,"24jpg":24}
Type_shoe_mapping = {"1合成革":1,"2反绒皮":2,"3网格":3,"4棉布":4,"5真皮革":5,"6帆布":6}
Type_shoelace_mapping = {"1弹力带":1,"2魔术贴":2,"3圆形鞋带":3,"4扁形鞋带":4}
Sex_mapping = {"1男":1,"2女":2}
Discipline_mapping = {"1服装专业":1,"2非服装":2}

eye_tracking_data["实验编号"] = eye_tracking_data["实验编号"].map(Experiment_number_mapping)
eye_tracking_data["ImageName"] = eye_tracking_data["ImageName"].map(Photo_number_mapping)
eye_tracking_data["鞋身类型"] = eye_tracking_data["鞋身类型"].map(Type_shoe_mapping)
eye_tracking_data["鞋带类型"] = eye_tracking_data["鞋带类型"].map(Type_shoelace_mapping)
eye_tracking_data["性别"] = eye_tracking_data["性别"].map(Sex_mapping)
eye_tracking_data["专业"] = eye_tracking_data["专业"].map(Discipline_mapping)

#print(eye_tracking_data.head)
#将表中的噪声 “.”转换成 NaN
eye_tracking_data["鞋身平均瞳孔直径"] = eye_tracking_data["鞋身平均瞳孔直径"].replace(r'\.+',np.nan,regex=True)
#直接舍弃残缺值
eye_tracking_data = eye_tracking_data.dropna()
#print(eye_tracking_data.tail(10))
#或对残缺值进行填充
#eye_tracking_data["鞋身平均瞳孔直径"] = eye_tracking_data["鞋身平均瞳孔直径"].fillna(value=0)
#print(eye_tracking_data.tail(10))

#以前14列数据作为样本特征X
X = eye_tracking_data[['实验编号','ImageName','鞋身类型', '鞋带类型','性别','专业','注视点个数','平均注视点持续时间','鞋带平均瞳孔直径','鞋带注视时间','鞋带注视点个数','鞋身平均瞳孔直径','鞋身注视时间','鞋身注视点个数']]
#print(X.head())
#以主观评价作为样本输出y
y = eye_tracking_data[['主观评价']]
#print(y.head())

#划分训练集与测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y,random_state=0)

#标准化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)
#print("standardized_X:",standardized_X,"/n","standardized_X_test:",standardized_X_test)

#归一化
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)
#print("normalized_X:",normalized_X,"/n","normalized_X_test:",normalized_X_test)

#二值化
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.0).fit(X)
binary_X = binarizer.transform(X)

'''
#编码分类特征
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(y)
'''

#输入缺失值
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=0, strategy='mean')
imp.fit_transform(X_train)

#对定量特征二值化
from sklearn.preprocessing import Binarizer
#二值化,阈值设置为3,返回值为二值化后的数据
Binarizer(threshold=3).fit_transform(eye_tracking_data)

#对定性特征哑编码
from sklearn.preprocessing import OneHotEncoder
#哑编码,对IRIS数据集的目标值,返回值为哑编码后的数据
OneHotEncoder().fit_transform(eye_tracking_data.values.reshape((-1,1)))
#print(OneHotEncoder().fit_transform(eye_tracking_data["性别"].values.reshape((-1,1))))

#Model
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# 逻辑回归模型
log_model = LogisticRegression()
# 训练逻辑回归模型
log_model.fit(X_train, y_train)
# 预测y的值
y_pred = log_model.predict(X_test)
# 查看测试结果
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))

'''
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print (linreg.intercept_)
print (linreg.coef_)
#模型拟合测试集
y_pred = linreg.predict(X_test)
from sklearn import metrics
# 用scikit-learn计算MSE
print ("MSE:",metrics.mean_squared_error(y_test, y_pred))
# 用scikit-learn计算RMSE
print ("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
'''
if __name__=='__main__':
main()

附录2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets.samples_generator import make_blobs
# X为样本特征,Y为样本簇类别, 共1000个样本,每个样本2个特征,共4个簇,簇中心在[-1,-1], [0,0],[1,1], [2,2], 簇方差分别为[0.4, 0.2, 0.2]
X, y = make_blobs(n_samples=1000, n_features=2, centers=[[-1,-1], [0,0], [1,1], [2,2]], cluster_std=[0.4, 0.2, 0.2, 0.2], random_state =9)
plt.scatter(X[:, 0], X[:, 1], marker='o')
plt.show()
'''
from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters=2, random_state=1).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()

from sklearn import metrics
metrics.calinski_harabaz_score(X, y_pred)

from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters=3, random_state=1).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()

metrics.calinski_harabaz_score(X, y_pred)

from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters=4, random_state=1).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()

metrics.calinski_harabaz_score(X, y_pred)
'''
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.cluster import MiniBatchKMeans
plt.subplot(2,2,1)
y_pred = MiniBatchKMeans(n_clusters=2, batch_size = 200, random_state=1).fit_predict(X)
score= metrics.calinski_harabaz_score(X, y_pred)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.text(.99, .01, ('k=%d, score: %.2f' % (2,score)),
transform=plt.gca().transAxes, size=10,
horizontalalignment='right')
plt.subplot(2,2,2)
y_pred = MiniBatchKMeans(n_clusters=3, batch_size = 200, random_state=1).fit_predict(X)
score= metrics.calinski_harabaz_score(X, y_pred)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.text(.99, .01, ('k=%d, score: %.2f' % (3,score)),
transform=plt.gca().transAxes, size=10,
horizontalalignment='right')
plt.subplot(2,2,3)
y_pred = MiniBatchKMeans(n_clusters=4, batch_size = 200, random_state=1).fit_predict(X)
score= metrics.calinski_harabaz_score(X, y_pred)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.text(.99, .01, ('k=%d, score: %.2f' % (4,score)),
transform=plt.gca().transAxes, size=10,
horizontalalignment='right')
plt.subplot(2,2,4)
y_pred = MiniBatchKMeans(n_clusters=5, batch_size = 200, random_state=1).fit_predict(X)
score= metrics.calinski_harabaz_score(X, y_pred)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.text(.99, .01, ('k=%d, score: %.2f' % (5,score)),
transform=plt.gca().transAxes, size=10,
horizontalalignment='right')
plt.show()

#划分训练集与测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y,random_state=1)

#Model
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# 逻辑回归模型
log_model = LogisticRegression()
# 训练逻辑回归模型
log_model.fit(X_train, y_train)
# 预测y的值
y_pred = log_model.predict(X_test)
# 查看测试结果
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))

'''
for index, k in enumerate((2,3,4,5)):
plt.subplot(2,2,index+1)
y_pred = MiniBatchKMeans(n_clusters=k, batch_size = 200, random_state=1).fit_predict(X)
score= metrics.calinski_harabaz_score(X, y_pred)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.text(.99, .01, ('k=%d, score: %.2f' % (k,score)),
transform=plt.gca().transAxes, size=10,
horizontalalignment='right')

#plt.show()
'''

课程设计

课程设计实验要求

image.png

实验目的

项目说明

本项目选取了从网上搜集的NBA凯尔特人队2010年-2018年赛季数据集进行分析,通过此数据集训练逻辑回归模型进行对NBA赛果的预测。运用anaconda与sklearn等相关库结合各种分类算法进行对本次课程的巩固学习,并用pyqt5封装了程序可视化界面。
数据集如图所示:
image.png
image.png

功能说明

数据清洗

image.png

特征变量值转换

image.png

数据预处理操作

image.png

算法部分

image.png
image.png

PyQt应用界面

程序逻辑及代码详情见最后附录完整代码和注释
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png

实验结果分析

数据处理的效果

image.png
image.png

训练集划分效果

image.png

分类算法得出结果

image.png

源程序清单及源代码

完整逻辑实现代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
# -*- coding: utf-8 -*-
from PyQt5.Qt import *
import numpy as np
import pandas as pd
from pylab import rcParams
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams
import matplotlib.cm as cm
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder,OneHotEncoder # 编码转换
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import StratifiedShuffleSplit

from sklearn.ensemble import RandomForestClassifier # 随机森林
from sklearn.svm import SVC, LinearSVC # 支持向量机
from sklearn.linear_model import LogisticRegression # 逻辑回归
from sklearn.neighbors import KNeighborsClassifier # KNN算法
from sklearn.naive_bayes import GaussianNB # 朴素贝叶斯
from sklearn.tree import DecisionTreeClassifier # 决策树分类器
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

from sklearn.metrics import classification_report, precision_score, recall_score, f1_score

def chang_for_number(x):
if x=='主':
return 1
if x=='客':
return 0
else:
return x

def guo_for_number(x):
if x=='胜' or x=='胜 ':
return 1
if x=='负':
return 0
else:
return x

def data_clean(x):
return str(x).replace('\n', '')

#分类算法
##############
#定义sigmoid函数
def sigmoid(inX):
return 1.0/(1+np.exp(-inX))

#梯度上升优化算法
def gradAscent(dataMatIn , classLabels , iterNum):
dataMatrix = np.mat(dataMatIn)
labelMat = np.mat(classLabels).transpose()
m,n = np.shape(dataMatrix)
alpha = 0.001
weights = np.ones((n,1))
for k in range(iterNum):
h = sigmoid(dataMatrix*weights)
error = (labelMat - h)
weights = weights + alpha*dataMatrix.transpose()*error
return weights

def stocGradDescent1(dataMatrix, classLabels, numIter=150):
"""
:param dataMatrix: Numpy2维数组
:param classLabels: List
:param numIter: 迭代次数
:return:
"""
m, n = np.shape(dataMatrix)
dataMat = np.mat(dataMatrix)
weights = np.ones((n, 1))
for j in range(numIter):
dataIndex = list(range(m)) # 存放样本索引[0 1 2 .. m-1]
for i in range(m):
alpha = 4/(1.0+j+i)+0.0001 # alpha 会随着迭代不断减小
# 随机取一个样本的索引,其值在[0,len(dataIndex))左闭右开区间范围内
randIndex = int(np.random.uniform(0, len(dataIndex)))
h = sigmoid(np.sum(dataMat[dataIndex[randIndex]]*weights))
error = classLabels[dataIndex[randIndex]] - h
weights = weights + alpha * error * dataMat[dataIndex[randIndex]].transpose()
# 删除该随机索引
del dataIndex[randIndex]
return np.array(weights)

def predict(theta, X):
y_predict = []
# print('theta',theta)
#print(np.shape(theta))
# print('X', X)

# print(np.shape(X))
probability = sigmoid(np.dot(X,theta))
# print(probability)
for x in probability:
# print(x)
if float(x) >= 0.5:
y_predict.append(1)
else:
y_predict.append(0)
return y_predict

class Window(QWidget):
def __init__(self):
super().__init__()
self.files=[]
self.initUI()

def select_file(self):
self.files, filetype = QFileDialog.getOpenFileNames(self,"多文件选择","./","Text Files (*.csv)")

for file in self.files:
self.l_test_load.setText(file)

pix = QPixmap("")
self.l_picture.setPixmap(pix)
self.table.clearContents()

def clear_file(self):
self.l_test_load.setText("")
self.l_train_load.setText("")
pix = QPixmap("")
self.l_picture.setPixmap(pix)
self.table.clearContents()

def showMsg(self):
QMessageBox.warning(self, "警告", "请先选择数据文件路径", QMessageBox.Yes)

def showMsgEnd(self):
QMessageBox.warning(self, "提示", "已经是最后一张图片", QMessageBox.Yes)

def showMsgStart(self):
QMessageBox.warning(self, "提示", "已经是第一张图片", QMessageBox.Yes)

def showNextPicture(self):
if self.index < len(self.p_load)-1:
self.index = self.index + 1
pix = QPixmap(self.p_load[self.index])
self.l_picture.setPixmap(pix)
else:
self.showMsgEnd()

def showBackPicture(self):
if self.index > 0:
self.index = self.index - 1
pix = QPixmap(self.p_load[self.index])
self.l_picture.setPixmap(pix)
else:
self.showMsgStart()

def process(self):
import numpy as np
import pandas as pd
if self.l_test_load.text() == "" :
self.showMsg()
else:
print("执行程序开始......")
NBAFILE=self.l_test_load.text()
df = pd.read_csv(NBAFILE, encoding='gbk')
df['场'] = df['场'].map(lambda x: chang_for_number(x))
df['果'] = df['果'].map(lambda x: guo_for_number(x))

import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = 'SimHei' ## 设置中文显示
plt.rcParams['axes.unicode_minus'] = False
churnvalue = df["果"].value_counts()
labels = df["果"].value_counts().index
rcParams["figure.figsize"] = 6, 6
plt.pie(churnvalue, labels=labels, colors=["whitesmoke", "yellow"], explode=(0.1, 0), autopct='%1.1f%%',
shadow=True)
plt.title("胜利失失败饼图")
plt.savefig('1.jpg')
plt.show()

f, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 10))
plt.subplot(1, 1, 1)
gender = sns.countplot(x="场", hue="果", data=df, palette="Pastel2") # palette参数表示设置颜色,这里设置为主题色Pastel2
plt.xlabel("场")
plt.title("主客场与胜利")
plt.savefig('2.jpg')
plt.show()

f, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 10))
plt.subplot(1, 1, 1)
seniorcitizen = sns.countplot(x="投篮命中", hue="果", data=df, palette="Pastel2")
plt.xlabel("投篮命中")
plt.title("投篮命中与结果")
plt.savefig('3.jpg')
plt.show()
f, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 10))
plt.subplot(1, 1, 1)
partner = sns.countplot(x="失误", hue="果", data=df, palette="Pastel2")
plt.xlabel("失误")
plt.title(" 失误 与 结果")
plt.savefig('4.jpg')
plt.show()
#
f, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 10))
plt.subplot(1, 1, 1)
dependents = sns.countplot(x="助攻", hue="果", data=df, palette="Pastel2")
plt.xlabel("助攻")
plt.title("助攻与结果")
plt.savefig('5.jpg')
plt.show()
5

########
# 提取特征
charges = df.iloc[:, 0:]
# 对特征进行编码
"""
离散特征的编码分为两种情况:
1、离散特征的取值之间没有大小的意义,比如color:[red,blue],那么就使用one-hot编码
2、离散特征的取值有大小的意义,比如size:[X,XL,XXL],那么就使用数值的映射{X:1,XL:2,XXL:3}
"""
corrDf = charges.apply(lambda x: pd.factorize(x)[0])
print(corrDf.head())
# 构造相关性矩阵
corr = corrDf.corr()
print(corr)

# 使用热地图显示相关系数
'''
heatmap 使用热地图展示系数矩阵情况
linewidths 热力图矩阵之间的间隔大小
annot 设定是否显示每个色块的系数值
'''
plt.figure(figsize=(20, 16))
ax = sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns,
linewidths=0.2, cmap="YlGnBu", annot=True)
plt.title("变量之间的相关性")
plt.savefig('6.jpg')
plt.show()

# 使用one-hot编码
tel_dummies = pd.get_dummies(df.iloc[:, 0:])
tel_dummies.head()

####################
###数据预处理

#
telcomvar = df.iloc[:, 0:]
print(telcomvar)
# telcomvar.drop("PhoneService", axis=1, inplace=True)

telcomvar.head()
#
# """
# 标准化数据,保证每个维度的特征数据方差为1,均值为0,使得预测结果不会被某些维度过大的特征值而主导。
# """
scaler = MinMaxScaler(copy=False)
# fit_transform()的作用就是先拟合数据,然后转化它将其转化为标准形式
scaler.fit_transform(telcomvar[['场', '投篮命中', '投篮出手', '三分命中', '三分出手', '罚球命中', '罚球出手', '前场篮板',
'后场篮板', '助攻', '抢断', '盖帽', '失误', '犯规', '得分']])
# # tranform()的作用是通过找中心和缩放等实现标准化
telcomvar[['场', '投篮命中', '投篮出手', '三分命中', '三分出手', '罚球命中', '罚球出手', '前场篮板',
'后场篮板', '助攻', '抢断', '盖帽', '失误', '犯规', '得分']] = scaler.transform(
telcomvar[['场', '投篮命中', '投篮出手', '三分命中', '三分出手', '罚球命中', '罚球出手', '前场篮板',
'后场篮板', '助攻', '抢断', '盖帽', '失误', '犯规', '得分']])
# 使用箱线图查看数据是否存在异常值
plt.figure(figsize=(8, 4))
print(telcomvar)
numbox = sns.boxplot(data=telcomvar[['场', '投篮命中', '投篮出手', '三分命中', '三分出手', '罚球命中', '罚球出手', '前场篮板',
'后场篮板', '助攻', '抢断', '盖帽', '失误', '犯规', '得分']], palette="Set2")
plt.title("检查标准化的异常值")
plt.savefig('7.jpg')
plt.show()

#################################算法部分
X = df[['场', '投篮命中', '投篮出手', '三分命中', '三分出手', '罚球命中', '罚球出手', '前场篮板',
'后场篮板', '助攻', '抢断', '盖帽', '失误', '犯规', '得分']]
y = df['果'].values
print(X)
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
print(sss)
print("训练数据和测试数据被分成的组数:", sss.get_n_splits(X, y))

# 建立训练数据和测试数据
for train_index, test_index in sss.split(X, y):
print("train:", train_index, "test:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]
# 输出数据集大小
print('原始数据特征:', X.shape,
'训练数据特征:', X_train.shape,
'测试数据特征:', X_test.shape)

print('原始数据标签:', y.shape,
' 训练数据标签:', y_train.shape,
' 测试数据标签:', y_test.shape)

# ### 选择机器学习算法
# 使用分类算法,这里选用10种分类算法
Classifiers = [["Random Forest", RandomForestClassifier()],
["Support Vector Machine", SVC()],
["LogisticRegression", LogisticRegression()],
["KNN", KNeighborsClassifier(n_neighbors=5)],
["Naive Bayes", GaussianNB()],
["Decision Tree", DecisionTreeClassifier()],
["AdaBoostClassifier", AdaBoostClassifier()],
["GradientBoostingClassifier", GradientBoostingClassifier()],
["XGB", XGBClassifier()],
["CatBoost", CatBoostClassifier(logging_level='Silent')]
]
# ### 训练模型
Classify_result = []
names = []
prediction = []
for name, classifier in Classifiers:
if name == 'LogisticRegression': # 逻辑回归
###随机梯度算法
weights2 = stocGradDescent1(X_train, y_train, 200)
# print(weights2)
y_predict = predict(weights2, X_test)
# print('y_predict', y_predict)
recall = recall_score(y_test, y_predict)
precision = precision_score(y_test, y_predict)
f1 = f1_score(y_test, y_predict)
recall = recall_score(y_test, y_predict)
precision = precision_score(y_test, y_predict)
else:
classifier = classifier
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("分类算法:", name, ' 准确率:', round(precision, 4), ' recall:', round(recall, 4), ' f1_socre:',
round(f1, 4))



############################

# 图片的路径,默认显示第一张
p_load=['1.jpg','2.jpg','3.jpg','4.jpg','5.jpg','6.jpg','7.jpg']
self.p_load=p_load
self.index=0
pix = QPixmap(self.p_load[self.index])
self.l_picture.setPixmap(pix)

#表格的数据

i=0
self.table.setItem(0, 0, QTableWidgetItem('1'))
self.table.setItem(0, 1, QTableWidgetItem('2'))
self.table.setItem(0, 2, QTableWidgetItem('3'))
self.table.setItem(0, 3, QTableWidgetItem('4'))
self.table.setItem(0, 4, QTableWidgetItem('5'))
self.table.setItem(0, 5, QTableWidgetItem('6'))
self.table.setItem(0, 6, QTableWidgetItem('7'))
self.table.setItem(0, 7, QTableWidgetItem('8'))
for i in range(0, 10, 1):
self.table.setItem(i, 0, QTableWidgetItem(str(i)))
self.table.setItem(i, 1, QTableWidgetItem(str(y_test[i])))
self.table.setItem(i, 2, QTableWidgetItem(str(y_predict[i])))




def initUI(self):
# 设置窗口
self.setWindowTitle("凯尔特人胜负预测数据分析")
self.resize(1000, 600)
self.move(400, 200)

# 设置显示测试数据文件名称及路径
self.l_test_name = QLabel(self)
self.l_test_name.setText("测试数据路径:")
self.l_test_name.resize(100, 20)
self.l_test_name.move(10, 20)
self.l_test_load = QLabel(self)
self.l_test_load.setText("")
self.l_test_load.resize(550, 20)
self.l_test_load.setStyleSheet("border: 1px solid black")
self.l_test_load.move(110, 20)


# 设置文件选择按钮
self.b_file = QPushButton(self)
self.b_file.setText("选择文件")
self.b_file.move(700, 20)
self.b_file.clicked.connect(self.select_file)

# 设置清空文件按钮
self.b_clear = QPushButton(self)
self.b_clear.setText("清空文件")
self.b_clear.move(800, 20)
self.b_clear.clicked.connect(self.clear_file)

# 设置开始按钮
self.b_start = QPushButton(self)
self.b_start.setText("开始")
self.b_start.move(900, 20)
self.b_start.clicked.connect(self.process)


# 设置数据展示
horizontalHeader = ['index', 'y', 'y_predict']
self.table = QTableWidget(self)
self.table.setColumnCount(3)
self.table.setRowCount(10)
self.table.horizontalHeader().setDefaultSectionSize(75)
self.table.verticalHeader().setDefaultSectionSize(45)
# self.table.horizontalHeader().resizeSection(0, 300)
self.table.verticalHeader().resizeSection(0, 40)
self.table.setHorizontalHeaderLabels(horizontalHeader)
self.table.resize(410,500)
self.table.move(20, 60)

# 设置图片展示
self.l_picture = QLabel(self)
self.l_picture.setGeometry(450, 60, 500, 500)
self.l_picture.setStyleSheet("border: 2px solid black")
self.l_picture.setScaledContents(True)

# 设置展示下一张图片
self.b_next = QPushButton(self)
self.b_next.setText("下一张")
self.b_next.move(850, 565)
self.b_next.clicked.connect(self.showNextPicture)

# 设置展示上一张图片
self.b_next = QPushButton(self)
self.b_next.setText("上一张")
self.b_next.move(750, 565)
self.b_next.clicked.connect(self.showBackPicture)

if __name__ == '__main__':
import sys
app = QApplication(sys.argv)
window = Window()
window.show()
sys.exit(app.exec())