Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review01 #15

Merged
merged 3 commits into from
Sep 30, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
1. update minkovski distance understanding; 2 update data file
  • Loading branch information
SmirkCao committed Sep 29, 2018
commit 2f1eca829cfaf5bd0681d591f905eca87eebf500
3 changes: 3 additions & 0 deletions CH03/Input/data_3-1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
1 1
5 1
4 4
6 changes: 6 additions & 0 deletions CH03/Input/data_3-2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
2 3
5 4
9 6
4 7
8 1
7 2
43 changes: 37 additions & 6 deletions CH03/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,13 @@

### 导读



kNN是一种基本分类与回归方法.

- kd树是存储k维空间数据的树结构
- 建立空间索引的方法在点云数据处理中也有广泛的应用,KD树和八叉树在3D点云数据组织中应用比较广
- kNN的k和kd树的k含义不同
-

## k近邻算法
k=1的情形, 称为最近邻算法. 书中后面的分析都是按照最近邻做例子, 这样不用判断类别, 可以略去一些细节.

Expand All @@ -31,7 +34,7 @@ k=1的情形, 称为最近邻算法. 书中后面的分析都是按照最近邻

### 距离度量

这里用到了$L_p$距离, 可以参考Wikipedia上$L_p$ Space词条(Refs[1])
这里用到了$L_p$距离, 可以参考Wikipedia上$L_p$ Space词条[^1]

1. p=1 对应 曼哈顿距离
1. p=2 对应 欧氏距离
Expand All @@ -49,6 +52,14 @@ $$L_p(x_i, x_j)=\left(\sum_{l=1}^{n}{\left|x_{i}^{(l)}-x_{j}^{(l)}\right|^p}\rig
1. 前一点换个表达方式, 图中的点向量($x_1$, $x_2$)的p范数都为1
1. 图中包含多条曲线, 关于p=1并没有对称关系

这里要补充一点:

范数是对向量或者矩阵的度量,是一个标量,这个里面两个点之间的$L_p$距离可以认为是两个点坐标差值的p范数。
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个可以再深入下


参考下例题3.1的测试案例,这个实际上没有用到模型的相关内容。




### k值选择
1. 关于k大小对预测结果的影响, 书中给的参考文献是ESL, 这本书还有个先导书叫ISL.
Expand All @@ -65,17 +76,37 @@ $\frac{1}{k}\sum_{x_i\in N_k(x)}{I(y_i\ne c_i)}=1-\frac{1}{k}\sum_{x_i\in N_k(x)

如果分类损失函数是0-1损失, 误分类率最低即经验风险最小.

关于经验风险, 参考书上第一章 (1.11)和(1.16)
关于经验风险, 参考书上[CH01](../CH01/README.md)第一章 (1.11)和(1.16)

## 实现

### 构造kd树



### 搜索kd树

## 例子

### 例3.1

分析p值对最近邻点的影响,这个有一点要注意关于闵可夫斯基距离的理解:

- 两点坐标差的p范数

具体看相关测试案例的实现

### 例3.2

kd树创建

### 例3.3

kd树搜索



## Refs
## 参考

1. [Lp Space](https://en.wikipedia.org/wiki/Lp_space)
1. [^1]: [Lp Space](https://en.wikipedia.org/wiki/Lp_space)
2. ESL
20 changes: 14 additions & 6 deletions CH03/knn.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from collections import namedtuple
from operator import itemgetter
from pprint import pformat
import numpy as np


class Node(namedtuple('Node', 'location left_child right_child')):
Expand All @@ -17,13 +18,20 @@ def __repr__(self):


class KNN(object):
def __init__(self, k=3, p=2):
def __init__(self,
k=3,
p=2):
"""

:param k: knn
:param p:
"""
self.k = k
self.p = p
self.kdtree = None

def lp_distance(self):
print(self.k)

return 1

@staticmethod
Expand All @@ -49,14 +57,14 @@ def _fit(point_list, depth=0):
def _search(self, point):
self.kdtree[0]

def fit(self, x_):
self.kdtree = KNN._fit(x_)
def fit(self, X):
self.kdtree = KNN._fit(X)
return self.kdtree

def predict(self, x_):
def predict(self, X):
return[[2]]

def predict_proba(self, x_):
def predict_proba(self, X):
pass


Expand Down
34 changes: 22 additions & 12 deletions CH03/unit_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,32 @@
# Filename: unit_test
# Date: 8/15/18
# Author: 😏 <smirk dot cao at gmail dot com>
from knn import *
import numpy as np
import argparse
import logging
import unittest
import knn


class TestStringMethods(unittest.TestCase):

def test_e31(self):
x = [[1, 1], [5, 1], [4, 4]]
y = [1, 2, 3]
rst = []
for p in range(1, 5):
clf_knn = knn.KNN(k=1, p=p)
clf_knn.fit(x[1:])
rst.extend(clf_knn.predict([x[0]]))

self.assertEqual(rst, [[2], [2], [2], [2]])
X = np.loadtxt("Input/data_3-1.txt")
# print(X-X[0])
rst = np.linalg.norm(X - X[0], ord=1, axis=1)
for p in range(2, 5):
rst = np.vstack((rst, np.linalg.norm(X-X[0], ord=p, axis=1)))
# Lp(x1,x2)
self.assertListEqual(np.round(rst[:, 1], 2).tolist(), [4]*4)
# Lp(x1,x3)
self.assertListEqual(np.round(rst[:, 2], 2).tolist(), [6, 4.24, 3.78, 3.57])
# print(np.round(rst[:, 2], 2).tolist())

def test_e32(self):
x = [[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]]
y = [1, 2, 3, 4, 5, 6]
clf_knn = knn.KNN(k=1, p=2)
clf_knn.fit(x_=x)
clf_knn = KNN(k=1, p=2)
clf_knn.fit(x)
self.assertEqual(clf_knn.kdtree, ([7, 2],
([5, 4],
([2, 3], None, None),
Expand All @@ -41,4 +45,10 @@ def test_e33(self):


if __name__ == '__main__':
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

ap = argparse.ArgumentParser()
ap.add_argument("-p", "--path", required=False, help="path to input data file")
args = vars(ap.parse_args())
unittest.main()
11 changes: 10 additions & 1 deletion index.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,13 @@ $P_{26}, P_{102}$ 支持向量机里面也有

### 内积

$P_{25}, P_{78}, P_{117}$在感知机、逻辑回归、支持向量机里面都有用到
$P_{25}, P_{78}, P_{117}$在感知机、逻辑回归、支持向量机里面都有用到

### 指示函数

$P_{40}, P_{}$

### $L_p$距离

$P_{38}$