学习资料:sklearn 中文文档
Alpha-Beta剪枝算法复习:Alpha-Beta剪枝算法(极大极小算法-人工智能) 下方为YouTube原视频,科学上网可看。
scikit-learn机器学习实验
实验要求:
《实验一》安装python, scikit-learn机器学习实验环境;下载实例数据,进行数据预处理操作;绘制可视化结果。去除噪声数据,特征变量值转换,统一量纲,归一化,可视化。
《实验二》聚类,逻辑回归或SVM
实验一
从anaconda官网:https://www.anaconda.com/download/下载安装实验环境,并添加到系统环境变量(其中默认已安装好所需要的numpy、scipy、pandas、sklearn等库)
使用老师提供的眼动跟踪数据.xlsx文件,使用Excel另存为兼容模式下的.xls文件方便使用pandas库导入。
1
2
3
4
5#导入数据集
eye_tracking_data=pd.read_excel('eye_tracking.xls', encoding='utf-8')
eye_tracking_data.columns = ['实验编号','ImageName','鞋身类型', '鞋带类型','性别','专业','注视点个数','平均注视点持续时间','鞋带平均瞳孔直径','鞋带注视时间','鞋带注视点个数','鞋身平均瞳孔直径','鞋身注视时间','鞋身注视点个数','主观评价']
print(eye_tracking_data.shape) #查询得数据维度为479行15列(479, 15)
print(eye_tracking_data.tail(50))接下来进行数据预处理,其中前6列为类别数据特征,利用pandas的map方法来实现有序特征的映射
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15#特征映射 以后应改为用独热编码 (one-hot encoding)
Experiment_number_mapping = {"s1":1,"s2":2,"s3":3,"s4":4,"s5":5,"s6":6,"s7":7,"s8":8,"s9":9,"s10":10,"s11":11,"s12":12,"s13":13,"s14":14,"s15":15,"s16":16,"s17":17,"s18":18,"s19":19,"s20":20}
Photo_number_mapping = {"1jpg":1,"2jpg":2,"3jpg":3,"4jpg":4,"5jpg":5,"6jpg":6,"7jpg":7,"8jpg":8,"9jpg":9,"10jpg":10,"11jpg":11,"12jpg":12,"13jpg":13,"14jpg":14,"15jpg":15,"16jpg":16,"17jpg":17,"18jpg":18,"19jpg":19,"20jpg":20,"21jpg":21,"22jpg":22,"23jpg":23,"24jpg":24}
Type_shoe_mapping = {"1合成革":1,"2反绒皮":2,"3网格":3,"4棉布":4,"5真皮革":5,"6帆布":6}
Type_shoelace_mapping = {"1弹力带":1,"2魔术贴":2,"3圆形鞋带":3,"4扁形鞋带":4}
Sex_mapping = {"1男":1,"2女":2}
Discipline_mapping = {"1服装专业":1,"2非服装":2}
eye_tracking_data["实验编号"] = eye_tracking_data["实验编号"].map(Experiment_number_mapping)
eye_tracking_data["ImageName"] = eye_tracking_data["ImageName"].map(Photo_number_mapping)
eye_tracking_data["鞋身类型"] = eye_tracking_data["鞋身类型"].map(Type_shoe_mapping)
eye_tracking_data["鞋带类型"] = eye_tracking_data["鞋带类型"].map(Type_shoelace_mapping)
eye_tracking_data["性别"] = eye_tracking_data["性别"].map(Sex_mapping)
eye_tracking_data["专业"] = eye_tracking_data["专业"].map(Discipline_mapping)
print(eye_tracking_data.head)输出得:
从第二点的图中可见其中鞋身平均瞳孔直径列中数据存在“.”这样的数据,可将其转换成NaN后对缺失值进行填充或直接舍弃。数据填充函数fillna中可根据values、method等参数填入均值、中位数、邻近值等…
1
2
3
4
5
6
7#将表中的噪声 “.”转换成 NaN
eye_tracking_data["鞋身平均瞳孔直径"] = eye_tracking_data["鞋身平均瞳孔直径"].replace(r'\.+',np.nan,regex=True)
#直接舍弃残缺值
eye_tracking_data = eye_tracking_data.dropna()
#print(eye_tracking_data.head(50))
#或对残缺值进行填充
#eye_tracking_data["鞋身平均瞳孔直径"] = eye_tracking_data["鞋身平均瞳孔直径"].fillna(value=0)
(可见473行数据被舍弃)
(可见473行原本为“.”的值被填充为0)然后设定样本特征与样本输出,并划分训练集与测试集
1
2
3
4
5
6
7#以前14列数据作为样本特征X
X = eye_tracking_data[['实验编号','ImageName','鞋身类型', '鞋带类型','性别','专业','注视点个数','平均注视点持续时间','鞋带平均瞳孔直径','鞋带注视时间','鞋带注视点个数','鞋身平均瞳孔直径','鞋身注视时间','鞋身注视点个数']]
#以主观评价作为样本输出y
y = eye_tracking_data[['主观评价']]
#划分训练集与测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y,random_state=0)统一量纲、归一化
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17#标准化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)
print("standardized_X:",standardized_X,"/n","standardized_X_test:",standardized_X_test)
#归一化
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)
print("normalized_X:",normalized_X,"/n","normalized_X_test:",normalized_X_test)
#二值化
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.0).fit(X)
binary_X = binarizer.transform(X)(另外部分对数据的预处理操作间附录中代码)
实验二中的逻辑回归
1
2
3
4
5
6
7
8
9
10
11
12#Model
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# 逻辑回归模型
log_model = LogisticRegression()
# 训练逻辑回归模型
log_model.fit(X_train, y_train)
# 预测y的值
y_pred = log_model.predict(X_test)
# 查看测试结果
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))实验二
因为对眼动跟踪数据中特征间的关系不是很熟悉,以及刚开始学习scikitlearn知识,接下来采用第二个数据集进行可视化的展示及学习K-Means聚类算法。
随机创建一些二维数据作为训练集,选择二维特征数据,方便可视化。
1
2
3
4
5
6
7import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
# X为样本特征,Y为样本簇类别, 共1000个样本,每个样本2个特征,共4个簇,簇中心在[-1,-1], [0,0],[1,1], [2,2], 簇方差分别为[0.4, 0.2, 0.2]
X, y = make_blobs(n_samples=1000, n_features=2, centers=[[-1,-1], [0,0], [1,1], [2,2]], cluster_std=[0.4, 0.2, 0.2, 0.2], random_state =1)
plt.scatter(X[:, 0], X[:, 1], marker='o')
plt.show()使用K-Means聚类方法来做聚类,并用matplotlib画出k=2,3,4,5对应的输出图
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.cluster import MiniBatchKMeans
plt.subplot(2,2,1)
y_pred = MiniBatchKMeans(n_clusters=2, batch_size = 200, random_state=1).fit_predict(X)
score= metrics.calinski_harabaz_score(X, y_pred)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.text(.99, .01, ('k=%d, score: %.2f' % (2,score)),
transform=plt.gca().transAxes, size=10,
horizontalalignment='right')
plt.subplot(2,2,2)
y_pred = MiniBatchKMeans(n_clusters=3, batch_size = 200, random_state=1).fit_predict(X)
score= metrics.calinski_harabaz_score(X, y_pred)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.text(.99, .01, ('k=%d, score: %.2f' % (3,score)),
transform=plt.gca().transAxes, size=10,
horizontalalignment='right')
plt.subplot(2,2,3)
y_pred = MiniBatchKMeans(n_clusters=4, batch_size = 200, random_state=1).fit_predict(X)
score= metrics.calinski_harabaz_score(X, y_pred)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.text(.99, .01, ('k=%d, score: %.2f' % (4,score)),
transform=plt.gca().transAxes, size=10,
horizontalalignment='right')
plt.subplot(2,2,4)
y_pred = MiniBatchKMeans(n_clusters=5, batch_size = 200, random_state=1).fit_predict(X)
score= metrics.calinski_harabaz_score(X, y_pred)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.text(.99, .01, ('k=%d, score: %.2f' % (5,score)),
transform=plt.gca().transAxes, size=10,
horizontalalignment='right')
plt.show()逻辑回归
1
2
3
4
5
6
7
8
9
10
11
12#Model
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# 逻辑回归模型
log_model = LogisticRegression()
# 训练逻辑回归模型
log_model.fit(X_train, y_train)
# 预测y的值
y_pred = log_model.predict(X_test)
# 查看测试结果
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
附录(代码)
附录1
1 | #encoding-utf-8 |
附录2
1 | import numpy as np |
课程设计
课程设计实验要求
实验目的
项目说明
本项目选取了从网上搜集的NBA凯尔特人队2010年-2018年赛季数据集进行分析,通过此数据集训练逻辑回归模型进行对NBA赛果的预测。运用anaconda与sklearn等相关库结合各种分类算法进行对本次课程的巩固学习,并用pyqt5封装了程序可视化界面。
数据集如图所示:
功能说明
数据清洗
特征变量值转换
数据预处理操作
算法部分
PyQt应用界面
程序逻辑及代码详情见最后附录完整代码和注释
实验结果分析
数据处理的效果
训练集划分效果
分类算法得出结果
源程序清单及源代码
完整逻辑实现代码
1 | # -*- coding: utf-8 -*- |