事实证明,网上的很多东西都是扯淡,不过原理还是有的,我碰到一个需求,给定一段文字中(GBK编码),去除所有符号,不相干符号,包括标点,日文片假名之类的,于是我在网上搜啊,搜啊,得到的结果要么就是错的,要么就根本好不相干,被抄袭得最厉害的是http://community.mybbchina.net/thread-212-post-507.html,可怜的作者,我也不知道这些人为什么抄这个,我就没测试成功过,连GBK的字符集区间都没弄清楚。
其实指定字符编码匹配中文很简单,GBK编码表中很明确的规定了所有的符号位置,而且刚好是一个区间,在维基百科上有详细的介绍,比如我要把类似
addd中华ds,#¥%…&((212))}}A■g民国ds中-啊国fjsd,【】啊Y
这样的字符串中所有非中文、英文、数字字符全部去除掉,姑且叫去除的字符为火星文,去除火星文的PHP写法为:
preg_replace("/([\xA1-\xA9].{1}?)*?/", "", $str);
这样所有在全角状态下输入的非中文,英文和数字将被替换掉,如果还需要进一步处理英文状态下的,那就判断一下ASCII码值吧,或者你怕麻烦也可以把所有的都列出来,然后替换,可以参考下面,在网上找来的,非我写的,忘了出处了
/**
* 清除所有常见标点符号
* @param $pointer
* @return unknown_type
*/
function clear_point($pointer)
{
return str_replace
(
array(’ ‘,"~","!","@","#","$","%","^","&","*",",",".","?",";",":","’",’"’," [",
"]","{","}","!"," ¥","……","…","、",",","。","?",";",":","‘","“","”",
"’"," 【","】","~","!","@","#","$","%","^","&","*",",","."," <",
[...]