正则表达式匹配中文符号(GBK编码)

事实证明,网上的很多东西都是扯淡,不过原理还是有的,我碰到一个需求,给定一段文字中(GBK编码),去除所有符号,不相干符号,包括标点,日文片假名之类的,于是我在网上搜啊,搜啊,得到的结果要么就是错的,要么就根本好不相干,被抄袭得最厉害的是http://community.mybbchina.net/thread-212-post-507.html,可怜的作者,我也不知道这些人为什么抄这个,我就没测试成功过,连GBK的字符集区间都没弄清楚。

其实指定字符编码匹配中文很简单,GBK编码表中很明确的规定了所有的符号位置,而且刚好是一个区间,在维基百科上有详细的介绍,比如我要把类似

addd中华ds,#¥%…&((212))}}A■g民国ds中-啊国fjsd,【】啊Y

这样的字符串中所有非中文、英文、数字字符全部去除掉,姑且叫去除的字符为火星文,去除火星文的PHP写法为:

preg_replace("/([\xA1-\xA9].{1}?)*?/", "", $str);

这样所有在全角状态下输入的非中文,英文和数字将被替换掉,如果还需要进一步处理英文状态下的,那就判断一下ASCII码值吧,或者你怕麻烦也可以把所有的都列出来,然后替换,可以参考下面,在网上找来的,非我写的,忘了出处了

	/**
	 * 清除所有常见标点符号
	 * @param $pointer
	 * @return unknown_type
	 */
	function clear_point($pointer)
	{
	     return str_replace
	       (
	       array(' ',"~","!","@","#","$","%","^","&","*",",",".","?",";",":","'",'"'," [",
	       "]","{","}","!"," ¥","……","…","、",",","。","?",";",":","‘","“","”",
	       "’"," 【","】","~","!","@","#","$","%","^","&","*",",","."," <",
	       ">",";",":","'",""","[","]","{","}","/","\","《","》","-","_"),
 
	       array('','','','','','','','','','','','','','','','','','','','','','','','',
	       '','','','','','','','','','','','','','','','','','','','','','','','','','',
	       '','','','','','','','','','','','','',''),
	          $pointer
	       );
	}

这个东西有更好的写法,而且全角状态的符号在之前已经过滤了,在此可删除,我懒,抛砖引玉吧

此条目发表在 Programming 分类目录,贴了 , 标签。将固定链接加入收藏夹。

正则表达式匹配中文符号(GBK编码)》有 1 条评论

  1. 则名 说:

    呃,话说我就是把需要匹配的中文符号都列出来的,很笨的方法

    [Reply]

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注

*

您可以使用这些 HTML 标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">