Lucene入门基础教程

195次阅读

共计 25195 个字符，预计需要花费 63 分钟才能阅读完成。

1.1 全文检索的概念

1) 从大量的信息中快速、准确地查找出要的信息

2) 搜索的内容是文本信息(不是多媒体)

3) 搜索的方式：不是根据语句的意思进行处理。如果要搜索的文本为”2012 年的春晚有赵本山吗”，那么含有这些词(2012 年、春晚、赵本山) 就能搜索出来。每一个词都是关键词。

4) 全面、快速、准确是衡量全文检索系统的关键指标。

5) 概括：

a) 只处理文本

b) 不处理语义

c) 搜索时英文不区分大小写

d) 结果列表有相关度排序

全文检索应用场景：

* 信息量必须特别大

* 做一个全文检索的指标

快速

准确

站内搜索

通常用于在大量数据出现的系统中，找出你想要的资料。常见的有

a) bbs 的关键字搜索

baidu 贴吧林志玲、胡汉三

b) 商品网站的搜索等

中关村在线商品的名称、电脑硬件名称 (CPU)

c) 文件管理系统

对文件的搜索功能。Window 的文件搜索

1.3.2 垂直搜索

a) 是针对某个行业的搜索引擎

b) 是搜索引擎的细分和延伸

c) 是针对网页库中的专门信息的整合

d) 其特点是专、深、精，并具有行业色彩

e) 可以应用于购物搜索、房产搜索、人才搜索

1.1 全文检索与数据库搜索的区别

1.4.1 数据库的搜索

类似：select * from 表名 where 字段名 like‘% 关键字 %’

例如：select * from article where content like’%here%’

结果: where here shere

缺点：

1) 搜索效果比较差

2) 在搜索的结果中，有大量的数据被搜索出来，有很多数据是没有用的。

3) 查询速度在大量数据的情况下是很难做到快速的。

1.4.2 全文检索

1) 搜索结果按相关度排序：意味着只有前几个页面对于用户来说是比较有用的，其他的结果与用户想要的答案很可能相差甚远。数据库搜索是做不到相关度排序的。

2) 因为全文检索是采用引索的方式，所以在速度上肯定比数据库方式 like 要快。

3) 所以数据库不能代替全文检索。

Lucene 的详细介绍：请点这里
Lucene 的下载地址：请点这里

基于 Lucene 多索引进行索引和搜索 http://www.linuxidc.com/Linux/2012-05/59757.htm

Lucene 实战(第 2 版) 中文版配套源代码 http://www.linuxidc.com/Linux/2013-10/91055.htm

Lucene 实战(第 2 版) PDF 高清中文版 http://www.linuxidc.com/Linux/2013-10/91052.htm

使用 Lucene-Spatial 实现集成地理位置的全文检索 http://www.linuxidc.com/Linux/2012-02/53117.htm

Lucene + Hadoop 分布式搜索运行框架 Nut 1.0a9 http://www.linuxidc.com/Linux/2012-02/53113.htm

Lucene + Hadoop 分布式搜索运行框架 Nut 1.0a8 http://www.linuxidc.com/Linux/2012-02/53111.htm

Lucene + Hadoop 分布式搜索运行框架 Nut 1.0a7 http://www.linuxidc.com/Linux/2012-02/53110.htm

Project 2-1: 配置 Lucene, 建立 WEB 查询系统[Ubuntu 10.10] http://www.linuxidc.com/Linux/2010-11/30103.htm

全文检索只是一个概念，而具体实现有很多框架，lucene 是其中的一种。Lucene 的主页 http://lucene.apache.org/。本文用的是 3.0.1 版本。

互联网搜索结构图

Lucene 入门基础教程

说明：

1) 当用户打开 www.baidu.com 网页搜索某些数据的时候，不是直接找的网页，而是找的百度的索引库。索引库里包含的内容有索引号和摘要。当我们打开 www.baidu.com 时，看到的就是摘要的内容。

2) 百度的索引库的索引和互联网的某一个网站对应。

3) 当用户数据要查询的关键字，返回的页面首先是从索引库中得到的。

4) 点击每一个搜索出来的内容进行相关网页查找，这个时候才找的是互联网中的网页。

2.2 lucene 的大致结构框图

Lucene 入门基础教程

说明：

1) 在数据库中，数据库中的数据文件存储在磁盘上。索引库也是同样，索引库中的索引数据也在磁盘上存在，我们用 Directory 这个类来描述。

2) 我们可以通过 API 来实现对索引库的增、删、改、查的操作。

3) 在数据库中，各种数据形式都可以概括为一种：表。在索引库中，各种数据形式也可以抽象出一种数据格式为 Document。

4) Document 的结构为：Document(List<Field>)

5) Field 里存放一个键值对。键值对都为字符串的形式。

6) 对索引库中索引的操作实际上也就是对 Document 的操作。

更多详情见请继续阅读下一页的精彩内容：http://www.linuxidc.com/Linux/2014-06/102856p2.htm

1.1 准备 lucene 的开发环境

搭建 lucene 的开发环境，要准备 lucene 的 jar 包，要加入的 jar 包至少有：

1) lucene-core-3.1.0.jar (核心包)

2) lucene-analyzers-3.1.0.jar (分词器)

3) lucene-highlighter-3.1.0.jar (高亮器)

4) lucene-memory-3.1.0.jar (高亮器)
Lucene 入门基础教程

public class Article {
private Long id;
private String title;
private String content;
public Long getId() {
return id;
}
public void setId(Long id) {
this.id = id;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getContent() {
return content;
}
public void setContent(String content) {
this.content = content;

}
}

@Test
public void testCreateIndex() throws Exception{
/**
* 1、创建一个 article 对象，并且把信息存放进去
* 2、调用 indexWriter 的 API 把数据存放在索引库中
* 3、关闭 indexWriter
*/
// 创建一个 article 对象，并且把信息存放进去
Article article = new Article();
article.setId(1L);
article.setTitle(“lucene 可以做搜索引擎 ”);
article.setContent(“baidu,google 都是很好的搜索引擎 ”);
// 调用 indexWriter 的 API 把数据存放在索引库中
/**
* 创建一个 IndexWriter
* 参数三个
* 1、索引库指向索引库的位置
* 2、分词器
*/
// 创建索引库
Directory directory = FSDirectory.open(new File(“./indexDir”));
// 创建分词器
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
IndexWriter indexWriter = new IndexWriter(directory, analyzer, MaxFieldLength.LIMITED);
// 把一个 article 对象转化成 document
Document document = new Document();
Field idField = new Field(“id”,article.getId().toString(),Store.YES,Index.NOT_ANALYZED);
Field titleField = new Field(“title”,article.getTitle(),Store.YES,Index.ANALYZED);
Field contentField = new Field(“content”,article.getContent(),Store.YES,Index.ANALYZED);
document.add(idField);
document.add(titleField);
document.add(contentField);
indexWriter.addDocument(document);
// 关闭 indexWriter
indexWriter.close();
}

@Test
public void testSearchIndex() throws Exception{
/**
* 1、创建一个 IndexSearch 对象
* 2、调用 search 方法进行检索
* 3、输出内容
*/
// 创建一个 IndexSearch 对象
Directory directory = FSDirectory.open(new File(“./indexDir”));
IndexSearcher indexSearcher = new IndexSearcher(directory);
// 调用 search 方法进行检索
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
QueryParser queryParser = new QueryParser(Version.LUCENE_30,”content”,analyzer);
Query query = queryParser.parse(“baidu”);// 关键词
TopDocs topDocs = indexSearcher.search(query, 2);
int count = topDocs.totalHits;// 根据关键词查询出来的总的记录数
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
List<Article> articleList = new ArrayList<Article>();
for(ScoreDoc scoreDoc:scoreDocs){
float score = scoreDoc.score;// 关键词得分
int index = scoreDoc.doc;// 索引的下标
Document document = indexSearcher.doc(index);
// 把 document 转化成 article
Article article = new Article();
article.setId(Long.parseLong(document.get(“id”)));//document.getField(“id”).stringValue()
article.setTitle(document.get(“title”));
article.setContent(document.get(“content”));
articleList.add(article);
}

for(Article article:articleList){
System.out.println(article.getId());
System.out.println(article.getTitle());
System.out.println(article.getContent());
}
}

如何把一个信息写到索引库中

Lucene 入门基础教程

读取信息的过程