php+中文分词scws+sphinx+mysql全文搜索

sphinx 安装

请参照https://jerryblog.cn/b_d@MNjmMY6f918e5@NNzmkY6f918e5.html" target="_blank" style="background-color: rgb(255, 255, 255);">https://jerryblog.cn/b_d@MNjmMY6f918e5@NNzmkY6f918e5.html

安装libsphinxclient

 class="hljs bash" codemark="1">cd coreseek-4.1-beta/csft-4.1/api/libsphinxclient
sh buildconf.sh
./configure --prefix=/usr/local/sphinxclient
make && make install

安装php拓展

##我WGET好像有问题，我是用WINDOW下载了，传到服务器的。
##此扩展是PHP7的包，其它包请百度查找
wget http:/ class="hljs-regexp">/git.php.net/?p=pecl/search_engine/sphinx.git;a=snapshot;h=9a3d08c67af0cad216aa0d38d39be71362667738;sf=tgz
# tar zxvf sphinx-9a3d08c.tar.gz
# cd cd sphinx-9a3d08c
#(此处为当前使用的PHP目录)
/www/wdlinux/apache_php/bin/phpize
./configure --with-php-config=/www/wdlinux/apache_php/bin/php-config --with-sphinx=/usr/local/sphinxclient
make && make install
安装后，在/www/wdlinux/apache_php-7.0.6/lib/php/extensions/no-debug-non-zts-20151012/目录下生成sphinx.so扩展 PHP目录不一样，生成文件目录也不一样。

在php.ini中加入

[sphinx]
extension_dir=/www/wdlinux/apache_php-7.0.6/lib/php/extensions/no-debug-non-zts-20151012/
extension=sphinx.so
##需要重启WEB服务

查看phpinfo();

https://imgs.jerryblog.cn/2018051815266367056257.png" alt="QQ截图20180518174401">

到此完成安装

以下采用"Main + Delta" ("主索引"+"增量索引")的索引策略，使用Sphinx自带的一元分词。

##测试数据

我们先要明白几个概念：

source：数据源，数据是从什么地方来的。
index：索引，当有数据源之后，从数据源处构建索引。索引实际上就是相当于一个字典检索。有了整本字典内容以后，才会有字典检索。
searchd：提供搜索查询服务。它一般是以deamon的形式运行在后台的。
indexer：构建索引的服务。当要重新构建索引的时候，就是调用indexer这个命令。
attr：属性，属性是存在索引中的，它不进行全文索引，但是可以用于过滤和排序。

配置文件可以查看:https://jerryblog.cn/b_d@MNjmMY6f918e5@ONDmcY6f918e5.html" target="_blank" style="">https://jerryblog.cn/b_d@MNjmMY6f918e5@ONDmcY6f918e5.html

style="max-width: 100%;">##重建索引:

/usr/local/sphinx/bin/indexer -c /usr/local/sphinx/etc/sphinx.conf --all
报错
FATAL: out of memory (unable to allocate 235667448 bytes)  
该提示表示当前环境下，可以使用的内存不足够；可能是因为其他程序占用内存较多导致可用内存太少；也有可能是因为配置中mem_limit设置太大。
最快的解决方法
修改sphinx.conf 文件，将mem_limit值改小，
启动sphinx
/usr/local/sphinx/bin/searchd -c /usr/local/sphinx/etc/sphinx.conf
查看进程
ps -ef |grep searchd
root     18294     1  0 15:16 ?        00:00:00 /usr/local/sphinx/bin/searchd -c /usr/local/sphinx/etc/sphinx.conf
root     18295 18294  0 15:16 ?        00:00:00 /usr/local/sphinx/bin/searchd -c /usr/local/sphinx/etc/sphinx.conf
root     18304 17955  0 15:17 pts/0    00:00:00 grep searchd
停止Searchd:
/usr/local/sphinx/bin/searchd -c /usr/local/sphinx/etc/sphinx.conf --stop
查看Searchd状态:
/usr/local/sphinx/bin/searchd -c /usr/local/sphinx/etc/sphinx.conf --status

索引更新及使用说明

"增量索引"每N分钟更新一次.通常在每天晚上低负载的时进行一次索引合并,同时重新建立"增量索引"。当然"主索引"数据不多的话，也可以直接重新建立"主索引"。
API搜索的时，同时使用"主索引"和"增量索引"，这样可以获得准实时的搜索数据.本文的Sphinx配置将"主索引"和"增量索引"放到分布式索引master中,因此只需查询分布式索引"master"即可获得全部匹配数据(包括最新数据)。
索引的更新与合并的操作可以放到cron job完成：

crontab -e  
/1    *  /usr/local/sphinx/shell/delta_index_update.sh  
0 3   *    /usr/local/sphinx/shell/merge_daily_index.sh  
crontab -l

#delta_index_update.sh:
#!/bin/bash  
/usr/local/sphinx/bin/indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate items_delta > /dev/null 2>&1

#merge_daily_index.sh:
#!/bin/bash  
indexer=which indexer  
mysql=which mysql  
QUERY="use sphinx_items;select max_doc_id from sph_counter where counter_id = 2 limit 1;"  
index_counter=$($mysql -h192.168.1.198 -uroot -p123456 -sN -e "$QUERY")  
#merge "main + delta" indexes  
$indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate --merge items items_delta --merge-dst-range deleted 0 0 >> /usr/local/sphinx/var/index_merge.log 2>&1  
if [ "$?" -eq 0 ]; then
##update sphinx counter  
if [ ! -z $index_counter ]; then  
    $mysql -h192.168.1.198 -uroot -p123456 -Dsphinx_items -e "REPLACE INTO sph_counter VALUES (1, '$index_counter')"  
fi  
##rebuild delta index to avoid confusion with main index  
$indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate items_delta >> /usr/local/sphinx/var/rebuild_deltaindex.log 2>&1  fi

php中文分词scws安装:注意扩展的版本和php的版本

wget -c http://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2  
tar jxvf scws-1.2.3.tar.bz2  
cd scws-1.2.3  
./configure --prefix=/usr/local/scws  
make && make install

##php.ini最后加入 ，记得重启WEB服务
[scws] 
extension_dir=/www/wdlinux/apache_php-7.0.6/lib/php/extensions/no-debug-non-zts-20151012/
extension = scws.so 
scws.default.charset = utf8 
scws.default.fpath = /usr/local/scws/etc
    
##关于sphinx其它API方法可以访问http://www.php.net/manual/zh/book.sphinx.php
php分词方法
/**
[scwsWord  scws分词方法   ]
@Author   Jerry
@DateTime 2018-05-22T17:19:57+0800
@Example  eg:
@param    [type]                   $word [ string 字符关键字]
@return   [type]                         [description]
*/
function scwsWord($word,$detail=false){
 //实例化分词插件核心类
 $so = scws_new();
 //设置分词时所用编码
 $so->set_charset('utf-8');
 //设置分词所用词典(此处使用utf8的词典)
 $so->set_dict('/usr/local/scws/etc/dict.utf8.xdb');
 //设置分词所用规则
 $so->set_rule('/usr/local/scws/etc/rules.utf8.ini ');
 //分词前去掉标点符号
 $so->set_ignore(true);
 //是否复式分割，如“中国人”返回“中国＋人＋中国人”三个词。
 $so->set_multi(true);
 //设定将文字自动以二字分词法聚合
 $so->set_duality(true);
 //要进行分词的语句
 $so->send_text($word);
 //获取分词结果，如果提取高频词用get_tops方法
 $wordArr = [];
 $word = [];
 while ($tmp = $so->get_result())
 {
     $wordArr[] = $tmp;
     foreach ($tmp as  $v) {
        $word[] = $v['word'];
     }
 }
 $so->close();
 $words = $detail?$wordArr:$word;
 return $words;
}
/**
[sphinx 分词方法]
@Author   Jerry
@DateTime 2018-05-24T11:53:53+0800
@Example  eg:
@param    [type]                   $word  [搜索的词]
@param    [type]                   $index [SPHINX索引名]
@param    string                   $ip    [服务的IP]
@param    integer                  $port  [服务的端口]
@return   [type]                          [description]
*/
function sphinx($word,$index='index_blog',$ip='127.0.0.1',$port=9312,$detail=false){
 $sc = new \SphinxClient();  
 $sc->SetServer('127.0.0.1',9312);  
 //SPH_MATCH_ALL, 匹配所有查询词(默认模式); SPH_MATCH_ANY, 匹配查询词中的任意一个; SPH_MATCH_EXTENDED2, 支持特殊运算符查询
 // $sc->SetMatchMode(SPH_MATCH_ALL);  
 $sc->SetMatchMode(SPH_MATCH_EXTENDED);  
 $sc->SetArrayResult(TRUE);  
 $re = $sc->Query($word,$index);  
 if($detail) return $re;
 $ids = [0];##防止SQL IN方法的报错
 if(count($re['matches'])>1){
foreach ($re['matches'] as $key => $v) {
    $ids[] = $v['id'];
}
}
 return implode(",", $ids);
}
/**
[sphinxScwsWord 组合使用]
@Author   Jerry
@DateTime 2018-05-28T16:08:33+0800
@Example  eg:
@return   [type]                   [description]
*/
function sphinxScwsWord($keyword,$sort='@id desc',$searchFromField='@title',$field='c_id,id,title',$index='*',$ip='127.0.0.1',$MatchMode = SPH_MATCH_FULLSCAN,$port=9312){
 //设定搜索词  
 $words_array =  scwsWord($keyword);  
 $words = "";  
 foreach($words_array as $v)  
 {  
 $words = $words.'|('.$v.')';  
 }  
 $words = trim($words,'|');  
 $sc = new \SphinxClient();  
 $sc->SetServer($ip,$port);   
    // SPH_MATCH_ALL匹配所有查询词（默认模式）
    // SPH_MATCH_ANY匹配查询词中的任意一个
    // SPH_MATCH_PHRASE将整个查询看作一个词组，要求按顺序完整匹配
    // SPH_MATCH_BOOLEAN将查询看作一个布尔表达式
    // SPH_MATCH_EXTENDED将查询看作一个Sphinx内部查询语言的表达式
    // SPH_MATCH_FULLSCAN使用完全扫描，忽略查询词汇
    // SPH_MATCH_EXTENDED2类似 SPH_MATCH_EXTENDED ，并支持评分和权重.
 $sc->SetMatchMode(SPH_MATCH_ANY);  //SPH_MATCH_EXTENDED
 $sc->SetArrayResult(TRUE);  
 $sc->SetSortMode(SPH_SORT_EXTENDED, $sort);##排序
 $sc->SetSelect ( $field); ##搜索的字段 
 $res = $sc->Query($searchFromField?$searchFromField:''.$words,$index);  
show($res);
show($words);
show($searchFromField);
show($words_array);
if(count($res['matches'])>1){
    $data = [];
    foreach ($res['matches'] as $k => $v) {
        ##组合数据
        $data[$k]['href'] = '/b_d@'.encode($v['attrs']['c_id']).'@'.$v['id'].'.html';
        $data[$k]['title'] = $v['attrs']['title'];
        ##分词色
        foreach ($words_array as $v) {
          $data[$k]['title'] = str_replace($v, "{$v}", $data[$k]['title']); 
        } 
        
    }
    return $data;
}else{
    return [];
}

 }