如何通过自然语言处理(NLP)实现文章摘要提取
前言
为了方便使用,我这里只是整理了网上的几种提取摘要的的使用方法,不做任何代码解析。
这几种方法我都成功测试过了,但是提取出来数据是有差异的,这里建议这几种方法对比参考后再使用。
Java,使用Classifier4J
支持英文提取,不支持中文提取
使用该方法,需要引入classifier4J.jar
1 | import net.sf.classifier4J.summariser.ISummariser; |
Java,使用HanLP(已测试)
支持中文、英文提取
官方网站:HanLP官网
官方Github文档:HanLP Github
官方Github Java版本文档:HanLP-Java Github
中文分词、词性标记、命名实体识别、依存句法分析、成分句法分析、语义依存分析、语义角色标记、指代消解、风格转换语义相貌、新词发现、关键词短语提取、自主摘要、汉语分类聚合类、拼音简繁转换、自然语言处理
需要引入maven依赖1
2
3
4
5
6<!-- hanlp依赖包 https://github.com/hankcs/HanLP/releases -->
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.8.4</version>
</dependency>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58import com.hankcs.hanlp.HanLP;
import java.util.List;
public class HanlpMain {
public static void main(String[] args) {
String content1 = "2019年12月2日,为了见到梦寐以求的复兴号G1,复兴号听说很漂亮,我必须要见到复兴号,于是花了10块钱买了一张火车票,我带着行李去了火车站等火车";
String content2 = "Bitcoin is yet to establish itself as the safe haven asset many see it as eventually becoming, falling sharply this week even as global stocks continue to slide—but in some places, people seem to be turning to bitcoin in times of crisis. The bitcoin price, which has dropped over the last week by almost 10%, has been trading at a premium in Hong Kong, where massive anti-government protests have crippled the city's major international airports for two straight days. Bitcoin was recently trading at a 4% premium in Hong Kong on the LocalBitcoins peer-to-peer bitcoin exchange which matches buyers to sellers, it was first reported by Bloomberg, a financial newswire. Hong Kong airport has now resumed operations five days after protesters first arrived with violent clashes erupting late last night and two men believed to be Chinese agents assaulted. Bitcoin advocates have been keen to encourage bitcoin's positive response to the unrest in Hong Kong. \"We learn Hong Kong will be put under martial law. Bitcoin goes up in price,\" bitcoin and crypto investor and co-founder of Morgan Creek Digital Assets Anthony Pompliano said earlier this month via Twitter. \"The insurance policy against global chaos and instability is working.\" Meanwhile, in Argentina, where the country's incumbent president Mauricio Macri lost by a far greater margin than expected in primary elections last weekend sending shockwaves through the region's financial markets, bitcoin has also been trading at a premium. Some have played down bitcoin's use in the country however, suggesting the U.S. dollar remains the alternative currency of choice. \"Before bitcoiners start using Argentina as excuse to yell 'buy bitcoin' ... Argentines want to protect themselves against the peso losing value versus the dollar. And for that, they buy dollars,\" Alex Kruger, an Argentina-born crypto trader, said via Twitter.";
String content3 = "The U.S.-China trade war and the recent violent Hong Kong protests are giving people a case of the jitters. They've pulled a staggering $64.7 billion out of China in the three months through July, according to new research from the Washingon D.C.-based think tank, The Institute of International Finance. No other emerging market country saw greater net outflows over the three months, the data show. The report titled \"Trade Tantrum Revisited\" reviewed net capital flows across emerging markets (EM) broadly. It stated: China is the second-largest economy in the world and the largest EM. The report shows that the recent outflow from China was more than double the $29.6 billion total that left the country for the whole of 2018. The figures represent the net flows and don't just include investments in debt and equity securities. In July, which is the latest data available, investors took out $9.4 billion, which came on the back of $33.8 billion and $21.5 billion in June and May respectively. These moves come after fast-deteriorating trade talks between Beijing and Washington led to increases in tariffs so sparking worries of a global economic slowdown. Also, the past few weeks has seen increasingly violent clashes between pro-democracy protesters and security forces in Hong Kong. At the same time, Beijing's heavy-handed approach to the crisis hasn't helped soothe anyone, drawing the ire of the international community. The violence and political uncertainty led e-commerce company Alibaba postponing a planned Hong Kong secondary listing of its stock earlier this month. What happened to China in August in terms of overall capital flows isn't clear yet. But preliminary data show China saw equity inflows of $1.5 billion during the month.";
//提取关键词
System.out.println(getKeyWord(content1, 5));
//提取文章摘要
System.out.println(getAbstractToList(content1, 5).toString());
System.out.println(getAbstractToString(content1, 5));
}
/**
* 提取关键词
* @param content 内容
* @param index 比重
*/
public static String getKeyWord(String content, Integer index){
if (index == null || index == 0){
index = 5;
}
List<String> result = HanLP.extractKeyword(content, index);
return result.toString();
}
/**
* 文本自动摘要
* @param content 内容
* @param index 比重
* @return List<String>
*/
public static List<String> getAbstractToList(String content, Integer index){
if (index == null || index == 0){
index = 1;
}
List<String> result = HanLP.extractSummary(content, index);
return result;
}
/**
* 文本自动摘要
* @param content 内容
* @param index 比重
* @return String
*/
public static String getAbstractToString(String content, Integer index){
if (index == null || index == 0){
index = 1;
}
String result = HanLP.getSummary(content, index);
return result;
}
}
Python,使用NLTK
需要下载nltk及插件
1 | 安装nltk依赖 |
1 | from nltk.tokenize import sent_tokenize, word_tokenize |