前言

为了方便使用,我这里只是整理了网上的几种提取摘要的的使用方法,不做任何代码解析。
这几种方法我都成功测试过了,但是提取出来数据是有差异的,这里建议这几种方法对比参考后再使用。

Java,使用Classifier4J

支持英文提取,不支持中文提取
使用该方法,需要引入classifier4J.jar

1
2
3
4
5
6
7
8
9
10
11
12
import net.sf.classifier4J.summariser.ISummariser;
import net.sf.classifier4J.summariser.SimpleSummariser;

public class Classifier4J {
public static void main1(String[] args) {
String str= "Here is the content of the article";
//SimpleSummariser s = new SimpleSummariser();
ISummariser s = new SimpleSummariser();
String result = s.summarise(str, 1);
System.out.println(result);
}
}

Java,使用HanLP(已测试)

支持中文、英文提取
官方网站:HanLP官网
官方Github文档:HanLP Github
官方Github Java版本文档:HanLP-Java Github
中文分词、词性标记、命名实体识别、依存句法分析、成分句法分析、语义依存分析、语义角色标记、指代消解、风格转换语义相貌、新词发现、关键词短语提取、自主摘要、汉语分类聚合类、拼音简繁转换、自然语言处理

需要引入maven依赖

1
2
3
4
5
6
<!-- hanlp依赖包 https://github.com/hankcs/HanLP/releases -->
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.8.4</version>
</dependency>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import com.hankcs.hanlp.HanLP;
import java.util.List;

public class HanlpMain {

public static void main(String[] args) {
String content1 = "2019年12月2日,为了见到梦寐以求的复兴号G1,复兴号听说很漂亮,我必须要见到复兴号,于是花了10块钱买了一张火车票,我带着行李去了火车站等火车";
String content2 = "Bitcoin is yet to establish itself as the safe haven asset many see it as eventually becoming, falling sharply this week even as global stocks continue to slide—but in some places, people seem to be turning to bitcoin in times of crisis. The bitcoin price, which has dropped over the last week by almost 10%, has been trading at a premium in Hong Kong, where massive anti-government protests have crippled the city's major international airports for two straight days. Bitcoin was recently trading at a 4% premium in Hong Kong on the LocalBitcoins peer-to-peer bitcoin exchange which matches buyers to sellers, it was first reported by Bloomberg, a financial newswire. Hong Kong airport has now resumed operations five days after protesters first arrived with violent clashes erupting late last night and two men believed to be Chinese agents assaulted. Bitcoin advocates have been keen to encourage bitcoin's positive response to the unrest in Hong Kong. \"We learn Hong Kong will be put under martial law. Bitcoin goes up in price,\" bitcoin and crypto investor and co-founder of Morgan Creek Digital Assets Anthony Pompliano said earlier this month via Twitter. \"The insurance policy against global chaos and instability is working.\" Meanwhile, in Argentina, where the country's incumbent president Mauricio Macri lost by a far greater margin than expected in primary elections last weekend sending shockwaves through the region's financial markets, bitcoin has also been trading at a premium. Some have played down bitcoin's use in the country however, suggesting the U.S. dollar remains the alternative currency of choice. \"Before bitcoiners start using Argentina as excuse to yell 'buy bitcoin' ... Argentines want to protect themselves against the peso losing value versus the dollar. And for that, they buy dollars,\" Alex Kruger, an Argentina-born crypto trader, said via Twitter.";
String content3 = "The U.S.-China trade war and the recent violent Hong Kong protests are giving people a case of the jitters. They've pulled a staggering $64.7 billion out of China in the three months through July, according to new research from the Washingon D.C.-based think tank, The Institute of International Finance. No other emerging market country saw greater net outflows over the three months, the data show. The report titled \"Trade Tantrum Revisited\" reviewed net capital flows across emerging markets (EM) broadly. It stated: China is the second-largest economy in the world and the largest EM. The report shows that the recent outflow from China was more than double the $29.6 billion total that left the country for the whole of 2018. The figures represent the net flows and don't just include investments in debt and equity securities. In July, which is the latest data available, investors took out $9.4 billion, which came on the back of $33.8 billion and $21.5 billion in June and May respectively. These moves come after fast-deteriorating trade talks between Beijing and Washington led to increases in tariffs so sparking worries of a global economic slowdown. Also, the past few weeks has seen increasingly violent clashes between pro-democracy protesters and security forces in Hong Kong. At the same time, Beijing's heavy-handed approach to the crisis hasn't helped soothe anyone, drawing the ire of the international community. The violence and political uncertainty led e-commerce company Alibaba postponing a planned Hong Kong secondary listing of its stock earlier this month. What happened to China in August in terms of overall capital flows isn't clear yet. But preliminary data show China saw equity inflows of $1.5 billion during the month.";
//提取关键词
System.out.println(getKeyWord(content1, 5));
//提取文章摘要
System.out.println(getAbstractToList(content1, 5).toString());
System.out.println(getAbstractToString(content1, 5));
}

/**
* 提取关键词
* @param content 内容
* @param index 比重
*/
public static String getKeyWord(String content, Integer index){
if (index == null || index == 0){
index = 5;
}
List<String> result = HanLP.extractKeyword(content, index);
return result.toString();
}

/**
* 文本自动摘要
* @param content 内容
* @param index 比重
* @return List<String>
*/
public static List<String> getAbstractToList(String content, Integer index){
if (index == null || index == 0){
index = 1;
}
List<String> result = HanLP.extractSummary(content, index);
return result;
}

/**
* 文本自动摘要
* @param content 内容
* @param index 比重
* @return String
*/
public static String getAbstractToString(String content, Integer index){
if (index == null || index == 0){
index = 1;
}
String result = HanLP.getSummary(content, index);
return result;
}
}

Python,使用NLTK

需要下载nltk及插件

1
2
3
4
5
6
7
# 安装nltk依赖
pip install nltk
# 通过cmd进入python控制台 (cmd输入python进入)
# 下载punkt、stopwords
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('stopwords')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest

stopwords = set(stopwords.words('english') + list(punctuation))
max_cut = 0.9
min_cut = 0.1
"""
计算出每个词出现的频率
word_sent 是一个已经分好词的列表
返回一个词典freq[],
freq[w]代表了w出现的频率
"""
def compute_frequencies(word_sent):
"""
defaultdict和普通的dict
的区别是它可以设置default值
参数是int默认值是0
"""
freq = defaultdict(int)
# 统计每个词出现的频率
for s in word_sent:
for word in s:
# 注意stopwords
if word not in stopwords:
freq[word] += 1

# 得出最高出现频次m
m = float(max(freq.values()))
# 所有单词的频次统除m
for w in list(freq.keys()):
freq[w] = freq[w] / m
if freq[w] >= max_cut or freq[w] <= min_cut:
del freq[w]
# 最后返回的是
# {key:单词, value: 重要性}
return freq

def summarize(text, n):
"""
用来总结的主要函数
text是输入的文本
n是摘要的句子个数
返回包含摘要的列表
"""
# 首先先把句子分出来
sents = sent_tokenize(text)
assert n <= len(sents)
# 然后再分词
word_sent = [word_tokenize(s.lower()) for s in sents]
# freq是一个词和词重要性的字典
freq = compute_frequencies(word_sent)
# ranking则是句子和句子重要性的词典
ranking = defaultdict(int)
for i, word in enumerate(word_sent):
for w in word:
if w in freq:
ranking[i] += freq[w]
sents_idx = rank(ranking, n)
return [sents[j] for j in sents_idx]
"""
考虑到句子比较多的情况
用遍历的方式找最大的n个数比较慢
我们这里调用heapq中的函数
创建一个最小堆来完成这个功能
返回的是最小的n个数所在的位置
"""
def rank(ranking, n):
return nlargest(n, ranking, key=ranking.get)

# str.txt内容就是需要提取摘要的文章
if __name__ == '__main__':
with open("str.txt", "r" , encoding="utf-8") as myfile:
text = myfile.read()
text = text.replace('\n','')
res = summarize(text, 1)
for i in range(len(res)):
print(res[i])