Project Address:
https://github.com/TheOneAC/ML.git
dataset in ML/ML_ation/knn
K近邻算法
- 优点:精度高、异常不敏感、无数据输入假定
- 缺点:计算复杂度高、空间复杂度高
- 适用数据:数值型、标称型
- 选择k个最相似数据中次数出现最多的分类,作为新数据的分类
k-means 伪码
- 计算当前点与已知分类点距离
- 按距离递增排序,选取最近的前K个
- 确定前k个点所在类别的出现频率
- 返回出现最高的频率最为当前点的分类返回
python code
|
|
|
|
使用第二列和第三列数据形成散点图
|
|
./test.py
修改,加入颜色
123 #ax.scatter(datingDatMat[:,1], datingDatMat[:,2])->>>>>ax.scatter(datingDatMat[:,1], datingDatMat[:,2],15.0 *array(datinglabels), 15.0 *array(datinglabels))
修改坐标参考,改为使用第一列和第二列数据
123 #ax.scatter(datingDatMat[:,1], datingDatMat[:,2],15.0 *array(datinglabels), 15.0 *array(datinglabels))->>>>>>>ax.scatter(datingDatMat[:,0], datingDatMat[:,1],15.0 *array(datinglabels), 15.0 *array(datinglabels))
数值归一化
newValue = (oldValue - min)/(max - min)
123456789101112131415161718192021222324 def autoNorm(dataSet):minVals = dataSet.min(0)maxVals = dataSet.max(0)ranges = maxVals - minValsnormDataSet = zeros(shape(dataSet))m = dataSet.shape[0]normDataSet = dataSet - tile(minVals,(m,1))normDataSet = normDataSet/tile(ranges,(m,1))return normDataSet,ranges, minValsdef datingClassTest():hoRatio = 0.05datingDatMat, datinglabels = file2matrix('datingTestSet2.txt')normSet,ranges, minVals = autoNorm(datingDatMat)m = normSet.shape[0]numTestVecs = int(m* hoRatio)errorCount = 0for i in range(numTestVecs):classifierResult = classify0(normSet[i,:],normSet[numTestVecs:m,:],datinglabels[numTestVecs:m],3)print "the classifier came back with: %d , the real answer is: %d" % (classifierResult, datinglabels[i])if(classifierResult != datinglabels[i]): errorCount += 1.0print "the total error rate is %f " % (errorCount/float(numTestVecs))knn.datingClassTest()
手写字符识别
|
|
./test.py
1234 #!/usr/bin/pythonfrom numpy import *import operatorknn.handwritingClassTest()