使用SVM对鸢尾花分类

百度AI Studio中的一个入门项目，增加了自己在实践时的一些注释，对小白来说阅读更顺畅。源码和数据在github上。

任务描述：

构建一个模型，根据鸢尾花的花萼和花瓣大小将其分为三种不同的品种。

数据集

总共包含150行数据

每一行数据由 4 个特征值及一个目标值组成。

4 个特征值分别为：萼片长度、萼片宽度、花瓣长度、花瓣宽度

目标值为三种不同类别的鸢尾花，分别为： Iris Setosa、Iris Versicolour、Iris Virginica

首先导入必要的包：

numpy：python第三方库，用于科学计算

matplotlib:python第三方库，主要用于进行可视化

sklearn:python的重要机器学习库，其中封装了大量的机器学习算法，如：分类、回归、降维以及聚类

import numpy as np                
from matplotlib import colors     
from sklearn import svm            
from sklearn.svm import SVC
from sklearn import model_selection
import matplotlib.pyplot as plt
import matplotlib as mpl

Step1.数据准备

(1)从指定路径下加载数据

(2)对加载的数据进行数据分割，x_train,x_test,y_train,y_test分别表示训练集特征、训练集标签、测试集特征、测试集标签

#*************将字符串转为整型，便于数据加载***********************
#在函数中建立一个对应字典就可以了，输入字符串，输出字符串对应的数字。
def iris_type(s):
#     print(type(s))
#字符串加个b是指btypes 字节串类型
    it = {b'Iris-setosa':0, b'Iris-versicolor':1, b'Iris-virginica':2}
    return it[s]

#加载数据
data_path='./iris.data'          #数据文件的路径
data = np.loadtxt(data_path,                                #数据文件路径
                  dtype=float,                              #数据类型
                  delimiter=',',                            #数据分隔符
                  converters={4:iris_type})                 #将第5列使用函数iris_type进行转换
# print(data)                                                 #data为二维数组，data.shape=(150, 5)
# print(data.shape)
#数据分割
x, y = np.split(data,                                       #要切分的数组
                (4,),                                       #沿轴切分的位置，第5列开始往后为y
                axis=1)                                     #1代表纵向分割，按列分割

x = x[:, 0:2] 
#第一个逗号之前表示行，只有冒号表示所有行，第二个冒号0:2表是0,1两列
#在X中我们取前两列作为特征，为了后面的可视化，原始的四维不好画图。x[:,0:4]代表第一维(行)全取，第二维(列)取0~2
# print(x)
x_train,x_test,y_train,y_test=model_selection.train_test_split(x,              #所要划分的样本特征集
                                                               y,              #所要划分的样本结果
                                                               random_state=1, #随机数种子确保产生的随机数组相同
                                                               test_size=0.3)  #测试样本占比

random_state=1确保了每次运行程序时用的随机数都是一样的，也就是每次重新运行后所划分的训练集和测试集的样本都是一致的，相当于只在第一次运行的时候进行随机划分。如果不设置的话，每次重新运行的种子不一样，产生的随机数也不一样就会导致每次随机生成的训练集和测试集不一致。

Step2.模型搭建

C越大，相当于惩罚松弛变量，希望松弛变量接近0，即对误分类的惩罚增大，趋向于对训练集全分对的情况，这样对训练集测试时准确率很高，但泛化能力弱。
C值小，对误分类的惩罚减小，允许容错，将他们当成噪声点，泛化能力较强。

kernel=’linear’时，为线性核

decision_function_shape=’ovr’时，为one v rest，即一个类别与其他类别进行划分，

decision_function_shape=’ovo’时，为one v one，即将类别两两之间进行划分，用二分类的方法模拟多分类的结果。
ovr是多类情况1和ovo是多类情况2，可以在我个人博客-线性判别函数上查看详细说明。

#**********************SVM分类器构建*************************
def classifier():
    #clf = svm.SVC(C=0.8,kernel='rbf', gamma=50,decision_function_shape='ovr')
    clf = svm.SVC(C=0.5,                         #误差项惩罚系数,默认值是1
                  kernel='linear',               #线性核 kenrel="rbf":高斯核
                  decision_function_shape='ovr') #决策函数
    return clf

1 2	# 2.定义模型：SVM模型定义 clf = classifier()

Step3.模型训练

1	y_train.ravel()#ravel()扁平化，将原来的二维数组转换为一维数组

array([2., 0., 0., 0., 1., 0., 0., 2., 2., 2., 2., 2., 1., 2., 1., 0., 2.,
       2., 0., 0., 2., 0., 2., 2., 1., 1., 2., 2., 0., 1., 1., 2., 1., 2.,
       1., 0., 0., 0., 2., 0., 1., 2., 2., 0., 0., 1., 0., 2., 1., 2., 2.,
       1., 2., 2., 1., 0., 1., 0., 1., 1., 0., 1., 0., 0., 2., 2., 2., 0.,
       0., 1., 0., 2., 0., 2., 2., 0., 2., 0., 1., 0., 1., 1., 0., 0., 1.,
       0., 1., 1., 0., 1., 1., 1., 1., 2., 0., 0., 2., 1., 2., 1., 2., 2.,
       1., 2., 0.])

#***********************训练模型*****************************
def train(clf,x_train,y_train):
    clf.fit(x_train,         #训练集特征向量，fit表示输入数据开始拟合
            y_train.ravel()) #训练集目标值 ravel()扁平化，将原来的二维数组转换为一维数组

1 2	# 3.训练SVM模型 train(clf,x_train,y_train)

Step4.模型评估

#**************并判断a b是否相等，计算acc的均值*************
def show_accuracy(a, b, tip):
    acc = a.ravel() == b.ravel()
    print('%s Accuracy:%.3f' %(tip, np.mean(acc)))

def print_accuracy(clf,x_train,y_train,x_test,y_test):
    #分别打印训练集和测试集的准确率  score(x_train,y_train):表示输出x_train,y_train在模型上的准确率
    print('trianing prediction:%.3f' %(clf.score(x_train, y_train)))
    print('test data prediction:%.3f' %(clf.score(x_test, y_test)))
    #原始结果与预测结果进行对比   predict()表示对x_train样本进行预测，返回样本类别
    show_accuracy(clf.predict(x_train), y_train, 'traing data')
    show_accuracy(clf.predict(x_test), y_test, 'testing data')
    #计算决策函数的值，表示x到各分割平面的距离,3类，所以有3个决策函数，不同的多类情况有不同的决策函数？
    print('decision_function:\n', clf.decision_function(x_train))

1 2	# 4.模型评估 print_accuracy(clf,x_train,y_train,x_test,y_test)

trianing prediction:0.819
test data prediction:0.778
traing data Accuracy:0.819
testing data Accuracy:0.778
decision_function:
 [[-0.5         1.20887337  2.29112663]
 [ 2.06328814 -0.0769677   1.01367956]
 [ 2.16674973  0.91702835 -0.08377808]
 [ 2.11427813  0.99765248 -0.11193061]
 [ 0.9925538   2.06392138 -0.05647518]
 [ 2.11742969  0.95255534 -0.06998503]
 [ 2.05615004 -0.041847    0.98569697]
 [-0.31866596  1.02685964  2.29180632]
 [-0.27166251  1.09150338  2.18015913]
 [-0.37827567  1.14260447  2.2356712 ]
 [-0.22150749  1.11104997  2.11045752]
 [-0.18331208  2.10066724  1.08264485]
 [-0.05444966  0.99927764  2.05517201]
 [-0.46977766  1.17853774  2.29123992]
 [-0.05760122  2.04437478  1.01322644]
 [ 2.1747228   0.93698124 -0.11170404]
 [-0.13315707  2.12021384  1.01294323]
 [-0.21752096  2.12102642  1.09649454]
 [ 2.11427813  0.99765248 -0.11193061]
 [ 2.16359817  0.96212549 -0.12572366]
 [-0.21038286  1.08590572  2.12447714]
 [ 2.21291822  0.9265985  -0.13951672]
 [-0.13399204  1.06514025  2.06885179]
 [-0.18016052  1.0555701   2.12459042]
 [-0.2334671   1.08112064  2.15234646]
 [-0.08782356  2.0747104   1.01311315]
 [-0.20324476  1.05078502  2.15245974]
 [-0.11489433  1.05994888  2.05494545]
 [ 2.17787437 -0.1081159   0.93024154]
 [-0.23578369  2.18129137  1.05449232]
 [-0.20639632  1.09588216  2.11051416]
 [-0.21038286  1.08590572  2.12447714]
 [-0.02969547  2.11420989  0.91548558]
 [-0.12685394  1.03001955  2.09683439]
 [-0.09496166  2.1098311   0.98513056]
 [ 2.10547008 -0.07737399  0.97190391]
 [ 2.11029159  0.98767604 -0.09796763]
 [ 2.20411017 -0.14842797  0.9443178 ]
 [-0.20324476  1.05078502  2.15245974]
 [ 2.19066895  0.97688701 -0.16755596]
 [-0.16022784  2.10545232  1.05477553]
 [-0.23661866  1.12621778  2.11040088]
 [-0.09579663  2.05475752  1.04103911]
 [ 2.11344315 -0.05742111  0.94397795]
 [ 2.10231852  0.96772315 -0.07004167]
 [-0.12203243  2.09506958  1.02696285]
 [ 2.11029159  0.98767604 -0.09796763]
 [-0.41248455  1.16296364  2.2495209 ]
 [-0.16820091  1.08549943  2.08270149]
 [-0.42045762  1.14301076  2.27744686]
 [-0.24857827  1.09628845  2.15228982]
 [-0.27796564  2.18169766  1.09626798]
 [-0.09264507  1.00966038  2.08298469]
 [-0.25339978  1.03123843  2.22216135]
 [-0.05361468  2.05435123  0.99926346]
 [ 2.15395516 -0.16797456  1.01401941]
 [-0.12203243  2.09506958  1.02696285]
 [ 2.06579305  1.08825305 -0.15404611]
 [-0.11007283  2.12499891  0.98507392]
 [-0.27166251  1.09150338  2.18015913]
 [ 2.13652739  0.94736397 -0.08389137]
 [-0.29789831  1.13181544  2.16608287]
 [ 2.15163856  0.93219616 -0.08383473]
 [ 2.1747228   0.93698124 -0.11170404]
 [-0.11174277  1.01485174  2.09689103]
 [-0.06872585  2.06951904  0.99920682]
 [-0.23745364  1.0711442   2.16630944]
 [ 2.12141623  0.96253178 -0.08394801]
 [ 2.1627632  -0.09294809  0.93018489]
 [-0.06557429  1.0244219   2.04115239]
 [ 2.16758471  0.97210193 -0.13968664]
 [-0.12203243  2.09506958  1.02696285]
 [ 2.1293893   0.98248467 -0.11187396]
 [-0.21038286  1.08590572  2.12447714]
 [ 2.01962457  1.0786829  -0.09830747]
 [ 2.18269588  0.95693412 -0.13963   ]
 [-0.16106282  1.05037873  2.11068408]
 [ 2.20976665  0.97169564 -0.1814623 ]
 [-0.03850351  2.03918342  0.9993201 ]
 [ 2.17555778  0.99205482 -0.1676126 ]
 [-0.11007283  2.12499891  0.98507392]
 [-0.07502898  2.15971332  0.91531566]
 [ 2.13254086  0.93738753 -0.06992839]
 [ 2.09518042  1.00284385 -0.09802427]
 [ 1.0045134   2.09385071 -0.09836411]
 [ 2.24314055  0.89626288 -0.13940344]
 [-0.09579663  2.05475752  1.04103911]
 [-0.14910321  1.08030806  2.06879515]
 [ 2.13652739  0.94736397 -0.08389137]
 [-0.2334671   1.08112064  2.15234646]
 [-0.07271239  2.05954259  1.0131698 ]
 [-0.2739791   2.1916741   1.082305  ]
 [-0.27564905  1.08152693  2.19412211]
 [-0.12203243  2.09506958  1.02696285]
 [ 2.06013657 -0.03187056  0.97173399]
 [ 2.07608272  1.00803521 -0.08411793]
 [-0.19443672  2.12581149  1.06862523]
 [-0.16421438  2.09547587  1.06873851]
 [-0.3440668   1.12224529  2.22182151]
 [-0.1180459   2.10504603  1.01299987]
 [-0.20240979  1.10585861  2.09655118]
 [-0.17617399  1.06554654  2.11062744]
 [-0.2477433   2.15136204  1.09638126]
 [-0.2334671   1.08112064  2.15234646]
 [ 2.11029159  0.98767604 -0.09796763]]

Step5.模型使用

np.mgrid的作用是用前两个特征生成其对应最大最小范围所能组合出的所有200*200的样本，也就是遍历了这两个特征所能组合出的所有可能性，只是粒度是1/200

def draw(clf, x):
    iris_feature = 'sepal length', 'sepal width', 'petal lenght', 'petal width'
    # 开始画图
    x1_min, x1_max = x[:, 0].min(), x[:, 0].max()               #第0列的范围
    x2_min, x2_max = x[:, 1].min(), x[:, 1].max()               #第1列的范围
    x1, x2 = np.mgrid[x1_min:x1_max:200j, x2_min:x2_max:200j]   #生成网格采样点 开始坐标：结束坐标（不包括）：步长
    #flat将二维数组转换成1个1维的迭代器，然后把x1和x2的所有可能值给匹配成为样本点
    grid_test = np.stack((x1.flat, x2.flat), axis=1)            #stack():沿着新的轴加入一系列数组，竖着（按列）增加两个数组，grid_test的shape：(40000, 2)
    print('grid_test:\n', grid_test)
    # 输出样本到决策面的距离
    z = clf.decision_function(grid_test)
    print('the distance to decision plane:\n', z)
    
    grid_hat = clf.predict(grid_test)                           # 预测分类值 得到【0,0.。。。2,2,2】
    print('grid_hat:\n', grid_hat)  
    grid_hat = grid_hat.reshape(x1.shape)                       # reshape grid_hat和x1形状一致
                                                                #若3*3矩阵e，则e.shape()为3*3,表示3行3列   
 	#light是网格测试点的配色，相当于背景
    #dark是样本点的配色
    cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
    cm_dark = mpl.colors.ListedColormap(['g', 'b', 'r'])
     #画出所有网格样本点被判断为的分类，作为背景
    plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)                                   # pcolormesh(x,y,z,cmap)这里参数代入
                                                                                      # x1，x2，grid_hat，cmap=cm_light绘制的是背景。
    #squeeze()把y的个数为1的维度去掉，也就是变成一维。
    plt.scatter(x[:, 0], x[:, 1], c=np.squeeze(y), edgecolor='k', s=50, cmap=cm_dark) # 样本点
    plt.scatter(x_test[:, 0], x_test[:, 1], s=200, facecolor='yellow', zorder=10, marker='+')       # 测试点
    plt.xlabel(iris_feature[0], fontsize=20)
    plt.ylabel(iris_feature[1], fontsize=20)
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.title('svm in iris data classification', fontsize=30)
    plt.grid()
    plt.show()

1 2	# 5.模型使用 draw(clf,x)

grid_test:
 [[4.3       2.       ]
 [4.3       2.0120603]
 [4.3       2.0241206]
 ...
 [7.9       4.3758794]
 [7.9       4.3879397]
 [7.9       4.4      ]]
the distance to decision plane:
 [[ 2.04663576  1.0980928  -0.14472856]
 [ 2.04808477  1.09663836 -0.14472313]
 [ 2.04953377  1.09518392 -0.1447177 ]
 ...
 [-0.21454554  0.96016146  2.25438408]
 [-0.21309653  0.95870702  2.25438951]
 [-0.21164753  0.95725258  2.25439495]]
grid_hat:
 [0. 0. 0. ... 2. 2. 2.]

png

参考

项目来自课程2-机器学习入门实践-鸢尾花分类