概述
标签传播算法(LPA, Label Propagation Algorithm)是一种基于标签传播的社区识别算法。每个节点初始化标签后,在算法的每次迭代中,每个节点都会将其标签更新为在其邻居中最普遍的标签。这个迭代过程会使连接密集的节点拥有同一标签,有相同标签的节点继而形成一个社区。
LPA不要求预先设定要划分的社区数量,也不将任何用于衡量社区划分优劣的标准作为目标,相反,它仅利用网络结构来引导其进程,这使得LPA能够有效地在大型和复杂的网络中运行。
算法的相关资料如下:
- U.N. Raghavan, R. Albert, S. Kumara, Near linear time algorithm to detect community structures in large-scale networks (2007)
基本概念
标签
指定的一个属性值或UUID对节点标签进行初始化。算法结束时,拥有相同标签的节点属于同一社区。
标签传播
在最简单的场景中,每次迭代传播时,每个节点都会将其标签更新为其最多邻居拥有的标签。
如下例所示,蓝色节点的标签将从d
更新为c
。
考虑节点和边权重时,标签权重则等于相应节点和边权重的乘积之和,每个节点将其标签更新为权重最大的那一个。
以下的示例中包含节点和边的权重,蓝色节点的标签将从d
更新为a
。
多标签传播
在多标签传播中,每个节点在传播过程中会保留多个标签。在这种情况下,节点的每个标签会关联一个与其权重成正比的标签概率,而每个节点的标签概率总和保持为1。
在下面的示例中,每个节点保留2个标签,其概率写在标签旁,蓝色节点的标签将从d, c
更新为a, c
,标签概率为Pa = 6.3/(6.3+1.85) = 0.77和Pc = 1.85/(6.3+1.85) = 0.23。
特殊说明
- LPA忽略边的方向,按照无向边进行计算。
- 具有自环边的节点能将自身当前的标签传播给自身,并且每条自环边计算两次。
- LPA在更新节点标签时遵循同步更新原则,即所有节点会根据其邻居的标签同时更新其标签。但在某些情况下可能会发生标签振荡,这多见于二分图中。为了解决这个问题,LPA包含一个中断机制,可以检测并防止标签过度振荡。
- 受节点顺序、同权重标签的随机选取以及并行计算等因素的影响,标签传播算法的社区划分结果可能会不同。
示例图集
创建示例图集:
// 在空图集中逐行运行以下语句
create().node_schema("user").edge_schema("connect")
create().node_property(@user,"interest",string).node_property(@user,"level",int32).edge_property(@connect,"strength",int32)
insert().into(@user).nodes([{_id:"A",interest:"flute",level:2}, {_id:"B",interest:"football",level:4}, {_id:"C",interest:"piano",level:4}, {_id:"D",interest:"violin",level:2}, {_id:"E",interest:"piano",level:4}, {_id:"F",interest:"movie",level:1}, {_id:"G",interest:"piano",level:4}, {_id:"H",interest:"tennis",level:2}, {_id:"I",interest:"violin",level:3}, {_id:"J",interest:"badminton",level:5}, {_id:"K",interest:"swimming",level:4}, {_id:"L",interest:"cello",level:1}, {_id:"M",interest:"saxophone",level:2}, {_id:"N",interest:"novel",level:3}, {_id:"O",interest:"swimming",level:3}])
insert().into(@connect).edges([{_from:"A",_to:"B",strength:3}, {_from:"A",_to:"C",strength:5}, {_from:"A",_to:"F",strength:8}, {_from:"A",_to:"K",strength:6}, {_from:"B",_to:"C",strength:2}, {_from:"C",_to:"D",strength:9}, {_from:"D",_to:"A",strength:5}, {_from:"D",_to:"E",strength:6}, {_from:"E",_to:"A",strength:5}, {_from:"F",_to:"G",strength:9}, {_from:"F",_to:"J",strength:4}, {_from:"G",_to:"H",strength:10}, {_from:"H",_to:"F",strength:3}, {_from:"I",_to:"H",strength:4}, {_from:"I",_to:"F",strength:2}, {_from:"J",_to:"I",strength:1}, {_from:"K",_to:"F",strength:1}, {_from:"K",_to:"N",strength:10}, {_from:"L",_to:"M",strength:1}, {_from:"L",_to:"N",strength:4}, {_from:"M",_to:"N",strength:8}, {_from:"M",_to:"K",strength:10}, {_from:"N",_to:"M",strength:4}, {_from:"O",_to:"N",strength:1}])
创建HDC图集
将当前图集全部加载到HDC服务器hdc-server-1
上,并命名为 hdc_lpa
:
CALL hdc.graph.create("hdc-server-1", "hdc_lpa", {
nodes: {"*": ["*"]},
edges: {"*": ["*"]},
direction: "undirected",
load_id: true,
update: "static",
query: "query",
default: false
})
hdc.graph.create("hdc_lpa", {
nodes: {"*": ["*"]},
edges: {"*": ["*"]},
direction: "undirected",
load_id: true,
update: "static",
query: "query",
default: false
}).to("hdc-server-1")
参数
算法名:lpa
参数名 |
类型 |
规范 |
默认值 |
可选 |
描述 |
---|---|---|---|---|---|
node_label_property |
"<@schema.?><property> " |
/ | / | 是 | 用于初始化节点标签的数值类型或字符串类型点属性;不包含指定属性的点将被忽略。若未设置,则由系统自动生成标签 |
node_weight_property |
"<@schema.?><property> " |
/ | / | 是 | 作为点权重的数值类型点属性 |
edge_weight_property |
"<@schema.?><property> " |
/ | / | 是 | 作为边权重的数值类型边属性 |
loop_num |
Integer | ≥1 | 5 |
是 | 传播迭代次数 |
k |
Integer | ≥1 | 1 |
是 | 指定计算结束时每个节点保留的最大标签数量,所有标签按概率降序排序 |
return_id_uuid |
String | uuid , id , both |
uuid |
是 | 在结果中使用_uuid 、_id 或同时使用两者来表示点 |
文件回写
CALL algo.lpa.write("hdc_lpa", {
params: {
return_id_uuid: "id",
k: 2,
loop_num: 5,
edge_weight_property: 'strength'
},
return_params: {
file: {
filename: "lpa"
}
}
})
algo(lpa).params({
projection: "hdc_lpa",
return_id_uuid: "id",
k: 2,
loop_num: 5,
edge_weight_property: 'strength'
}).write({
file: {
filename: "lpa"
}
})
数据库回写
将结果中的label_<N>
与对应的probability_<N>
值写入指定点属性。对应属性类型分别为string
和float
。
CALL algo.lpa.write("hdc_lpa", {
params: {
node_label_property: 'interest',
k: 2,
loop_num: 10
},
return_params: {
db: {
property: "lab"
}
}
})
algo(lpa).params({
projection: "hdc_lpa",
node_label_property: 'interest',
k: 2,
loop_num: 10
}).write({
db: {
property: "lab"
}
})
各点的标签和标签概率被写入新属性lab_1
,probability_1
,lab_2
和probability_2
中。
统计回写
CALL algo.lpa.write("hdc_lpa", {
params: {
node_label_property: 'interest',
k: 2,
loop_num: 10
},
return_params: {
stats: {}
}
})
algo(lpa).params({
projection: "hdc_lpa",
node_label_property: 'interest',
k: 2,
loop_num: 10
}).write({
stats:{}
})
结果:
label_count |
---|
6 |
完整返回
CALL algo.lpa("hdc_lpa", {
params: {
return_id_uuid: "id",
node_label_property: "@user.interest",
k: 2
},
return_params: {}
}) YIELD r
RETURN r
exec{
algo(lpa).params({
return_id_uuid: "id",
node_label_property: "@user.interest",
k: 2
}) as r
return r
} on hdc_lpa
结果:
_id | label_1 | probability_1 | label_2 | probability_2 |
---|---|---|---|---|
I | badminton | 0.517124 | movie | 0.482876 |
G | movie | 0.563411 | badminton | 0.436589 |
J | movie | 0.605133 | badminton | 0.394867 |
D | piano | 0.701716 | flute | 0.298284 |
N | swimming | 0.675096 | saxophone | 0.324904 |
F | badminton | 0.564691 | movie | 0.435309 |
H | movie | 0.535167 | badminton | 0.464833 |
B | piano | 0.646695 | flute | 0.353305 |
L | novel | 0.510868 | swimming | 0.489132 |
A | piano | 0.736380 | flute | 0.263620 |
O | novel | 0.765123 | swimming | 0.234877 |
E | piano | 0.594943 | flute | 0.405057 |
K | novel | 0.510868 | swimming | 0.489132 |
M | novel | 0.515860 | swimming | 0.484140 |
C | piano | 0.640369 | flute | 0.359631 |
流式返回
CALL algo.lpa("hdc_lpa", {
params: {
return_id_uuid: "id",
node_label_property: "@user.interest",
node_weight_property: "@user.level",
edge_weight_property: "strength",
loop_num: 10
},
return_params: {
stream: {}
}
}) YIELD r
RETURN r.label_1 AS label, count(r) GROUP BY label
exec{
algo(lpa).params({
return_id_uuid: "id",
node_label_property: "@user.interest",
node_weight_property: "@user.level",
edge_weight_property: "strength",
loop_num: 10
}).stream() as r
group by r.label_1 as label
return table(label, count(r))
} on hdc_lpa
结果:
label | count(r) |
---|---|
violin | 3 |
tennis | 2 |
swimming | 3 |
novel | 2 |
piano | 5 |
统计返回
CALL algo.lpa("hdc_lpa", {
params: {
node_label_property: "interest",
edge_weight_property: "strength",
k: 1,
loop_num: 5
},
return_params: {
stats: {}
}
}) YIELD s
RETURN s
exec{
algo(lpa).params({
node_label_property: "interest",
edge_weight_property: "strength",
k: 1,
loop_num: 5
}).stats() as s
return s
} on hdc_lpa
结果:
label_count |
---|
5 |