TextRank - 图分析与算法

修改密码

提交

Change Email

Submit

修改昵称

当前昵称：

提交

基础信息

用户邮箱：

用户昵称：
手机号：
公司名称：
公司邮箱:

修改密码

申请证书

当前未申请证书.

申请证书

Certificate	Issued at	Valid until	Serial No.	File

Serial No.	Valid until	File

Not having one? Apply now! >>>

ProductName	CreateTime	ID	Price	File

ProductName	CreateTime	ID	Price	File

No Invoice

创建嬴图账号

我已阅读并同意隐私政策和数据处理协议。

请勾选表示您已阅读并同意

已有嬴图账号？去登录！

忘记密码

重置密码

返回登录

TextRank

HDC

概述

TextRank（文本摘要）起源于PageRank，是基于图的文本处理排序模型，可应用于各种自然语言处理任务，如关键字提取、关键短语提取和文本摘要。

R. Mihalcea, P. Tarau, TextRank: Bringing Order Into Texts (2004)

基本概念

转换文本为图集

应用TextRank算法时，需先将文本表示为图。图结构取决于具体应用场景：

点：将最适合文本处理任务的文本单元（如字、词或句）作为节点添加至图集中。
边：文本单元之间的关系，如语义相似度、共现或上下文重叠，是连接各点的边。

构建提取关键词语的示例图：节点是从文本中选择的词汇单元，边是根据定义的单词窗口里的共现关系建立的（来源：原论文）

TextRank模型

所有文本单元的排名由“推荐”机制递归计算获得，过程与PageRank算法类似。然而TextRank使用了改进后的公式，将边权重考虑在内：

其中，

Out(v)代表点v指向的点集合；
w_vu代表点v和点u间的边权重；
d是阻尼系数。

特殊说明

孤立文本单元的排名与(1 - d)的值相同。
自环既被视作后继，也被视为前驱，节点通过自环将排名传递给自身。如果网络中有许多自环，算法需要更多次迭代才能收敛。

示例图集

创建示例图集：

// 在空图集中逐行运行以下语句
create().edge_property(@default, "weight", int32)
insert().into(@default).nodes([{_id:"A"}, {_id:"B"}, {_id:"C"}, {_id:"D"}, {_id:"E"}, {_id:"F"}, {_id:"G"}])
insert().into(@default).edges([{_from:"A", _to:"E", weight:3}, {_from:"B", _to:"A", weight:3}, {_from:"B", _to:"E", weight:2}, {_from:"C", _to:"A", weight:1}, {_from:"C", _to:"D", weight:4}, {_from:"D", _to:"E", weight:5}, {_from:"E", _to:"G", weight:2}, {_from:"F", _to:"B", weight:1}, {_from:"F", _to:"G", weight:3}])

创建HDC图集

将当前图集全部加载到HDC服务器hdc-server-1上，并命名为 hdc_textrank：

CALL hdc.graph.create("hdc-server-1", "hdc_textrank", {
  nodes: {"*": ["*"]},
  edges: {"*": ["*"]},
  direction: "undirected",
  load_id: true,
  update: "static",
  query: "query",
  default: false
})

hdc.graph.create("hdc_textrank", {
  nodes: {"*": ["*"]},
  edges: {"*": ["*"]},
  direction: "undirected",
  load_id: true,
  update: "static",
  query: "query",
  default: false
}).to("hdc-server-1")

参数

算法名：text_rank

参数名	类型	规范	默认值	可选	描述
`init_value`	Float	>0	`0.2`	是	所有点的初始排名
`loop_num`	Integer	≥1	`5`	是	最大迭代轮数。算法将在完成所有轮次后停止
`damping`	Float	(0,1)	`0.8`	是	阻尼系数
`max_change`	Float	≥0	`0`	是	某轮迭代后，若所有点的排名变化小于指定`max_change`时，表明结果已稳定，算法会停止。设置为`0`时停用此标准
`edge_schema_property`	[]"`<@schema.?><property>`"	/	/	否	作为权重的数值类型边属性，权重值为所有指定属性值的总和；不包含指定属性的边将被忽略
`return_id_uuid`	String	`uuid`, `id`, `both`	`uuid`	是	在结果中使用`_uuid`、`_id`或同时使用两者来表示点
`limit`	Integer	≥-1	`-1`	是	限制返回的结果数；`-1`返回所有结果
`order`	String	`asc`, `desc`	/	是	根据`rank`分值对结果排序

文件回写

CALL algo.text_rank.write("hdc_textrank", {
  params: {
    return_id_uuid: "id",
    init_value: 1,
    loop_num: 50,
    damping: 0.8,
    edge_schema_property: "weight",
    order: 'desc'
  },
  return_params: {
    file: {
      filename: "textrank"
    }
  }
})

algo(text_rank).params({
  projection: "hdc_textrank",
  return_id_uuid: "id",
  init_value: 1,
  loop_num: 50,
  damping: 0.8,
  edge_schema_property: "weight",
  order: 'desc'
}).write({
  file: {
    filename: "textrank"
  }
})

结果：

_id,text_rank
G,0.973568
E,0.81696
A,0.3472
D,0.328
B,0.24
F,0.2
C,0.2

数据库回写

将结果中的text_rank值写入指定点属性。该属性类型为double。

CALL algo.text_rank.write("hdc_textrank", {
  params: {
    loop_num: 50,
    edge_schema_property: "@default.weight"
  },
  return_params: {
    db: {
      property: "rank"
    }
  }
})

algo(text_rank).params({
  projection: "hdc_textrank",
  loop_num: 50,
  edge_schema_property: "@default.weight"
}).write({
  db:{ 
    property: 'rank'
  }
})

完整返回

CALL algo.text_rank("hdc_textrank", {
  params: {
    return_id_uuid: "id",    
    init_value: 1,
    loop_num: 50,
    damping: 0.8,
    edge_schema_property: "weight",
    order: 'desc',
    limit: 5
  },
  return_params: {}
}) YIELD TR
RETURN TR

exec{
  algo(text_rank).params({
    return_id_uuid: "id",    
    init_value: 1,
    loop_num: 50,
    damping: 0.8,
    edge_schema_property: "weight",
    order: 'desc',
    limit: 5
  }) as TR
  return TR
} on hdc_textrank

结果：

_id	text_rank
G	0.973568
E	0.81696
A	0.3472
D	0.328
B	0.24

流式返回

CALL algo.text_rank("hdc_textrank", {
  params: {
    return_id_uuid: "id",
    loop_num: 50,
    damping: 0.8,
    edge_schema_property: "weight",
    order: 'desc',
    limit: 5
  },
  return_params: {
  	stream: {}
  }
}) YIELD TR
RETURN TR

exec{
  algo(text_rank).params({
    return_id_uuid: "id",
    loop_num: 50,
    damping: 0.8,
    edge_schema_property: "weight",
    order: 'desc',
    limit: 5
  }).stream() as TR
  return TR
} on hdc_textrank

结果：

_id	text_rank
G	0.973568
E	0.81696
A	0.3472
D	0.328
B	0.24

ID
产品
状态
核数
Shard 服务最大数量
Shard 服务最大总核数
HDC 服务最大数量
HDC 服务最大总核数
申请天数
审批日期
过期日期
MAC地址
申请理由
审核信息