Posted by & filed under Excellence Article.

我知道这个问题很宽泛 … 也知道这不是一两句话能说清的 …

但是我还是想知道搜索引擎获得结果的原理 …

我有无数个爬虫爬下来一千万条也就是 10M 条的文本 …

现在我要从这 10M 条文本里面获得含有特定字符串的内容 …

分词或其他都不考虑 … 就是有 strpos 特定字符串的文本返回 …

搜索引擎是怎么做到的呢 ..?

一般的搜索引擎你输入关键字之后结果非常快就反馈出来了 …

但是在后台它是如何操作的 ..?

因为待搜索的内容是不确定的 … keyword 索引这种方法显然不可行 …

那么遍历存储的每一个文件 …?

这样的效率真的会比 MATCH AGAINST 高么 ..?

另外说哪怕没有任何记录搜索引擎也会很快提示 Not Found …

如果说一次扫过 10M 条记录 …

数据读取速度也和硬盘转速也是瓶颈 …

不可能有这么快的响应 …

或者是 1k 台服务器的集群 … 每台服务器负责搜索 10k 的数据 …

这也是需要时间的 …

但是现在哪怕在超高并发的情况下 …

搜索速度也不见减慢 …

到底搜索引擎是怎么实现返回搜索结果的呢 ..? – -#

本文是Sunyanzi同学在这里讨论的内容,同时也是我长久以来的疑问,在这里备份一下。

Posted by & filed under Study & Reading.

he future of software development is about good craftsmen. With infrastructure like Amazon Web Services and an abundance of basic libraries, it no longer takes a village to build a good piece of software.

These days, a couple of engineers who know what they are doing can deliver complete systems. In this post, we discuss the top 10 concepts software engineers should know to achieve that.

A successful software engineer knows and uses design patterns, actively refactors code, writes unit tests and religiously seeks simplicity. Beyond the basic methods, there are concepts that good software engineers know about. These transcend programming languages and projects – they are not design patterns, but rather broad areas that you need to be familiar with. The top 10 concepts are:

  1. Interfaces
  2. Conventions and Templates
  3. Layering
  4. Algorithmic Complexity
  5. Hashing
  6. Caching
  7. Concurrency
  8. Cloud Computing
  9. Security
  10. Relational Databases

10. Relational Databases

Relational Databases have recently been getting a bad name because they cannot scale well to support massive web services. Yet this was one of the most fundamental achievements in computing that has carried us for two decades and will remain for a long time. Relational databases are excellent for order management systems, corporate databases and P&L data.

At the core of the relational database is the concept of representing information in records. Each record is added to a table, which defines the type of information. The database offers a way to search the records using a query language, nowadays SQL. The database offers a way to correlate information from multiple tables.

The technique of data normalization is about correct ways of partitioning the data among tables to minimize data redundancy and maximize the speed of retrieval.

9. Security

With the rise of hacking and data sensitivity, the security is paramount. Security is a broad topic that includes authentication, authorization, and information transmission.

Authentication is about verifying user identity. A typical website prompts for a password. The authentication typically happens over SSL (secure socket layer), a way to transmit encrypted information over HTTP. Authorization is about permissions and is important in corporate systems, particularly those that define workflows. The recently developed OAuth protocol helps web services to enable users to open access to their private information. This is how Flickr permits access to individual photos or data sets.

Another security area is network protection. This concerns operating systems, configuration and monitoring to thwart hackers. Not only network is vulnerable, any piece of software is. Firefox browser, marketed as the most secure, has to patch the code continuously. To write secure code for your system requires understanding specifics and potential problems.

8. Cloud Computing

In our recent post Reaching For The Sky Through Compute Clouds we talked about how commodity cloud computing is changing the way we deliver large-scale web applications. Massively parallel, cheap cloud computing reduces both costs and time to market.

Cloud computing grew out of parallel computing, a concept that many problems can be solved faster by running the computations in parallel.

After parallel algorithms came grid computing, which ran parallel computations on idle desktops. One of the first examples was SETI@home project out of Berkley, which used spare CPU cycles to crunch data coming from space. Grid computing is widely adopted by financial companies, which run massive risk calculations. The concept of under-utilized resources, together with the rise of J2EE platform, gave rise to the precursor of cloud computing: application server virtualization. The idea was to run applications on demand and change what is available depending on the time of day and user activity.

Today’s most vivid example of cloud computing is Amazon Web Services, a package available via API. Amazon’s offering includes a cloud service (EC2), a database for storing and serving large media files (S3), an indexing service (SimpleDB), and the Queue service (SQS). These first blocks already empower an unprecedented way of doing large-scale computing, and surely the best is yet to come.

7. Concurrency

Concurrency is one topic engineers notoriously get wrong, and understandibly so, because the brain does juggle many things at a time and in schools linear thinking is emphasized. Yet concurrency is important in any modern system.

Concurrency is about parallelism, but inside the application. Most modern languages have an in-built concept of concurrency; in Java, it’s implemented using Threads.

A classic concurrency example is the producer/consumer, where the producer generates data or tasks, and places it for worker threads to consume and execute. The complexity in concurrency programming stems from the fact Threads often needs to operate on the common data. Each Thread has its own sequence of execution, but accesses common data. One of the most sophisticated concurrency libraries has been developed by Doug Lea and is now part of core Java.

6. Caching

No modern web system runs without a cache, which is an in-memory store that holds a subset of information typically stored in the database. The need for cache comes from the fact that generating results based on the database is costly. For example, if you have a website that lists books that were popular last week, you’d want to compute this information once and place it into cache. User requests fetch data from the cache instead of hitting the database and regenerating the same information.

Caching comes with a cost. Only some subsets of information can be stored in memory. The most common data pruning strategy is to evict items that are least recently used (LRU). The prunning needs to be efficient, not to slow down the application.

A lot of modern web applications, including Facebook, rely on a distributed caching system called Memcached, developed by Brad Firzpatrick when working on LiveJournal. The idea was to create a caching system that utilises spare memory capacity on the network. Today, there are Memcached libraries for many popular languages, including Java and PHP.

5. Hashing

The idea behind hashing is fast access to data. If the data is stored sequentially, the time to find the item is proportional to the size of the list. For each element, a hash function calculates a number, which is used as an index into the table. Given a good hash function that uniformly spreads data along the table, the look-up time is constant. Perfecting hashing is difficult and to deal with that hashtable implementations support collision resolution.

Beyond the basic storage of data, hashes are also important in distributed systems. The so-called uniform hash is used to evenly allocate data among computers in a cloud database. A flavor of this technique is part of Google’s indexing service; each URL is hashed to particular computer. Memcached similarly uses a hash function.

Hash functions can be complex and sophisticated, but modern libraries have good defaults. The important thing is how hashes work and how to tune them for maximum performance benefit.

4. Algorithmic Complexity

There are just a handful of things engineers must know about algorithmic complexity. First is big O notation. If something takes O(n) it’s linear in the size of data. O(n^2) is quadratic. Using this notation, you should know that search through a list is O(n) and binary search (through a sorted list) is log(n). And sorting of n items would take n*log(n) time.

Your code should (almost) never have multiple nested loops (a loop inside a loop inside a loop). Most of the code written today should use Hashtables, simple lists and singly nested loops.

Due to abundance of excellent libraries, we are not as focused on efficiency these days. That’s fine, as tuning can happen later on, after you get the design right.

Elegant algorithms and performance is something you shouldn’t ignore. Writing compact and readable code helps ensure your algorithms are clean and simple.

3. Layering

Layering is probably the simplest way to discuss software architecture. It first got serious attention when John Lakos published his book about Large-scale C++ systems. Lakos argued that software consists of layers. The book introduced the concept of layering. The method is this. For each software component, count the number of other components it relies on. That is the metric of how complex the component is.

Lakos contended a good software follows the shape of a pyramid; i.e., there’s a progressive increase in the cummulative complexity of each component, but not in the immediate complexity. Put differently, a good software system consists of small, reusable building blocks, each carrying its own responsibility. In a good system, no cyclic dependencies between components are present and the whole system is a stack of layers of functionality, forming a pyramid.

Lakos’s work was a precursor to many developments in software engineering, most notably Refactoring. The idea behind refactoring is continuously sculpting the software to ensure it’is structurally sound and flexible. Another major contribution was by Dr Robert Martin from Object Mentor, who wrote about dependecies and acyclic architectures

Among tools that help engineers deal with system architecture are Structure 101 developed by Headway software, and SA4J developed by my former company, Information Laboratory, and now available from IBM.

2. Conventions and Templates

Naming conventions and basic templates are the most overlooked software patterns, yet probably the most powerful.

Naming conventions enable software automation. For example, Java Beans framework is based on a simple naming convention for getters and setters. And canonical URLs in del.icio.us: http://del.icio.us/tag/software take the user to the page that has all items tagged software.

Many social software utilise naming conventions in a similar way. For example, if your user name is johnsmith then likely your avatar is johnsmith.jpg and your rss feed is johnsmith.xml.

Naming conventions are also used in testing, for example JUnit automatically recognizes all the methods in the class that start with prefix test.

The templates are not C++ or Java language constructs. We’re talking about template files that contain variables and then allow binding of objects, resolution, and rendering the result for the client.

Cold Fusion was one of the first to popularize templates for web applications. Java followed with JSPs, and recently Apache developed handy general purpose templating for Java called Velocity. PHP can be used as its own templating engine because it supports eval function (be careful with security). For XML programming it is standard to use XSL language to do templates.

From generation of HTML pages to sending standardized support emails, templates are an essential helper in any modern software system.

1. Interfaces

The most important concept in software is interface. Any good software is a model of a real (or imaginary) system. Understanding how to model the problem in terms of correct and simple interfaces is crucial. Lots of systems suffer from the extremes: clumped, lengthy code with little abstractions, or an overly designed system with unnecessary complexity and unused code.

Among the many books, Agile Programming by Dr Robert Martin stands out because of focus on modeling correct interfaces.

In modeling, there are ways you can iterate towards the right solution. Firstly, never add methods that might be useful in the future. Be minimalist, get away with as little as possible. Secondly, don’t be afraid to recognize today that what you did yesterday wasn’t right. Be willing to change things. Thirdly, be patient and enjoy the process. Ultimately you will arrive at a system that feels right. Until then, keep iterating and don’t settle.

Conclusion

Modern software engineering is sophisticated and powerful, with decades of experience, millions of lines of supporting code and unprecidented access to cloud computing. Today, just a couple of smart people can create software that previously required the efforts of dozens of people. But a good craftsman still needs to know what tools to use, when and why.

In this post we discussed concepts that are indispensible for software engineers. And now tell us please what you would add to this list. Share with us what concepts you find indispensible in your daily software engineering journeys.

Posted by & filed under Programming.

1、不要急,先知道什么是HTTP协议

2、接下来你可以看看HTML和CSS,并能够进行一般的应用

3、是该学PHP的时候了,熟练掌握每个细节是不可能的,但是对着PHP手册,你至少能找到你需要的功能函数(方法),并能够正确的使用它们。

4、就算是最简单的应用,哪怕是个记事本程序,都需要涉及到数据库(你可以使用文本或者其他的方式去储存,但是绝对没有数据库来得方便和强大),所以这一步,你得去读一下基本的SQL语法,然后能够使用它们。

5、一个好点的论坛,常常逛逛,有问题就问,当然你也可以回答别人的问题,不但帮助了别人还能帮自己理清思路,说不定别人的问题,下次就会出现在你的面前。

6、学一下XML吧,并试着用PHP去操作它,这个东西可能你不经常用到,但是它们在很多场合是非常重要的。

7、找一个你喜欢的Linux版本玩一下,推荐FreeBSD,熟悉一下基本的配置环境、编辑器使用和简单的命令等等,如果你去面试PHP开发的职位,我保证你考官一定会出Linux方面的题目的

8、OK,现在你已经掌握了基本的东西了,你需要提高了,去研究一下正则表达式吧,如果你找一本正则表达的书,看一般之后就能理解并记住,恭喜你,你是个天才或者是火星来的。如果不是,请尝试阅读多遍,正则表达式能够帮你在处理文本时节省很多脑细胞和时间。

9、下面研究一下设计模式吧,不需要很多,这些东西只有用的时候才知道,但是你需要掌握最基本的几个设计模式,比如MVC,Factory,当你掌握了这些设计模式之后,回头再看看你你曾经引以为豪的程序吧,那些就像长在河边的杂草,乱七八糟的。试着用你觉得合适的设计模式去重构你以前写过的程序,这对你很有帮助,面试的时候你就知道了。

10、在你动手一个巨大的工程之前,我建议你先了解一下什么叫框架,开源的框架有很多,我推荐你去研究Zend Framework,我喜欢它是因为它的文档足够的完善,你可以找到几乎所有你遇到的问题的解释。然后去读一本叫做《Zend framework in action》的书,并试着利用zf去构建一些你喜欢的东西,如果你看完ZF的源码,那当然是最好,如果你没有时间,我建议你选择其中的一个或者几个模块去读一下,然后利用这些模块去构建一个小的系统,这对你很有帮助。

11、好了,如果你上面十条都完成了,恭喜你,你可以选择你喜欢的公司去面试了,推荐你去一些开源的技术构建的网络公司,那里不需要学历,经历,只看重你的能力和潜力,如果你足够自信,你可以试试一些大的公司,比如ebay,比如yahoo,也比如加入我们Blogbus :p

Posted by & filed under Operating System.

以前一直用集成环境,在公司分别安装各个程序,真是受罪啊

找了很多教程都不行,最后这个帮我解决了

1、软件准备:
Apache2.2 下载地址:http://httpd.apache.org/download.cgi
PHP5.2     下载地址:http://cn2.php.net/
2、安装并设置环境变量:
安装Apache2.2到 D:Apache2.2
解压PHP5.2到 D:php5.2
在环境变量->系统变量Path后添加 ;D:php5.2
3、设置配置文件:
Apache配置:
打开 D:Apache2.2confhttpd.conf 文件
查找 LoadModule 块,在后面添加:
LoadModule php5_module d:php5.2php5apache2_2.dll
查找 DocumentRoot, 将后面的第一个 <Directory></Directory>块改为:
<Directory “D:/php5.2″>
Options FollowSymLinks
AllowOverride None
Order deny,allow
Deny from all
Satisfy all
</Directory>
查找 ScriptAlias 块,在后面添加:
ScriptAlias /php/ “D:/php5.2/”
查找 DirectoryIndex 块,修改为:
DirectoryIndex index.php default.php index.html
index.htm default.html default.htm
查找 AddType 块, 在后面添加:
AddType application/x-httpd-php .php
Action Application/x-httpd-php “D:/php5.2/php.exe”
PHP配置:
将 php.ini-recommended 文件改名为 php.ini,打开php.ini文件
查找 extension_dir, 修改为:
extension_dir = “d:php5.2″
4、测试:
在 D:Apache2.2htdocs 下新建文件 test.php:
<HTML>
<HEAD>
<TITLE>
test
</TITLE>
</HEAD>
<BODY>
<H1>
First PHP page
</H1>
<HR>
<?php
// Single line C++ style comment
/*
printing the message
*/
echo “Hello World!”;
# Unix style single line comment
?>
</BODY>
</HTML>
在IE地址栏上输入 http://localhost/test.php, 有 Hello World! 显示表示配置成功。
本文转载自:http://hi.baidu.com/foollee/blog/item/76802451588c6e8a8c5430b7.html

Posted by & filed under Tools.

自从NOTEPAD++ 在主页上打出“Beijing Olympic boycott”之后,我就不再用这个软件了,找来找去,发现仅有几个编辑软件可以用来写代码,Emacs 太强大,用起来不顺手,Windows下面还有Linux模拟器,效率更低,Intype不支持中文而且还收费,虽然目前阶段不收费,而且没有文档,bundles不知道怎么用,更不知道怎么编辑,edit plus就算了吧,主题太难看了,ultraedit 也是,难看!最后选中了VIM,话说这VIM的确很强大,很多功能都不知道怎么弄,还好有很详细的帮助文档,虽然都是英文的,多少也能看懂几句 :)
五个月来,也习惯了VIM的使用,但是总是觉得内置的那几个配色方案不满意,又没时间学着自己配色,只能勉强用着内置的evening 配色方案,今天我在UBUNTU论坛发现宝贝了,321种配色方案,终于找到自己喜欢的了 :)

点击这里下载

Posted by & filed under Tools.

在CnBeta上看到出现Google音乐搜索,并照着网址,进去了,作者说是404页面,我这里居然出现了,而且是跳转到另外的网址上去了,而且除了搜索之外,只有最下面几个链接有效,复选框可以选择,但是试听按钮始终是灰色的,右上角的“打开播放器”按钮也不能用,难道不支持Firefox? 用IE6试了一下,发现可以用,在点击试听歌曲的时候,出来巨鲸网的使用协议,点击“同意”后出现下面的画面,不说了自己看吧,呵呵

Posted by & filed under Programming.

在用visual studio 2005编辑asp.net页面时,有时会遇到:”***”不是已知元素。错误情况如下图:
error
原因可能是网站中存在编译错误。
这有可能是代码中的”***”确实有错,但是有时候你会发现代码根本就没有错,却还是出现了这种问题,这种问题的现象是:所有标签都不能被vs2005正确识别,但是编译可以通过,页面也可以正常运行。虽然不影响程序运行,但是只要出了这种问题,vs 2005功能就大大缩水了,在“源码”模式下什么提示功能都没了。
一般这种问题出现在编辑使用了“Master Page(母版页)”的页面过程中,可能的一个解决方法如下:将出现问题的页面切换到“设计”模式,在设计页面上点击右键,然后点“编辑主表”,就会切换到相应的Master Page中去,不需要真正编辑Master Page,直接返回到出问题的页面,这样,那些标签又可以识别了。
如下图:
解决
本方法仅供参考,具体问题还需具体分析

Tags:

Posted by & filed under Programming.

小生不才,一年前学过SQL皮毛,如今除了会一些简单的增删改查语句之外,其他的都还回去了,近日碰到一个问题:

有 A ,B,C 三张表,A 中存放一个id、comboid,combname,B中有 id,comboid,sid,C 中有sid,sname,B表的作用其实是相当于关联A和C的作用,首先在A中得到一个唯一的comboid,然后通过comboid在B中查询sid,再通过sid去C中查询对应的sname,问题就出在这里,如果comboid和sid是一一对应,那很好办,只要

SELECT
 	sname
FROM c
	WHERE sid =
	(
		SELECT sid FROM b WHERE  comboid =
                         (SELECT comboid FROM a WHERE comboname=@comboname)
	);

但事实上B中的关系是多对多的关系,一个comboid对应多个sid,这两天中我甚至写出好几个超级复杂的SQL查询语句,包括用到JOIN等连接操作和合并表,还有先建成视图,然后查询之类

实际上SQL内建了一个关键词“in”,这个词非常的有用,只要将上面的语句改一处,就可以适应返回结果不唯一的嵌套查询,代码如下:

SELECT
 	sname
FROM c
	WHERE sid IN
	(
		SELECT sid FROM b WHERE  comboid  =
                         (SELECT comboid FROM a WHERE comboname=@comboname)
	);

另外说明一下,本语句只在SQL Server 2005 和MySQL 5.0上测试过,只是提供一个思路,请根据自己的情况灵活运用,有任何指教请给我留言或者mail我,相信你能找到我的联系方式,呵呵

Posted by & filed under Study & Reading.

功能:调用博客标签列表

标准语法(以我的侧边栏的标签云为例):

<?php
 wp_tag_cloud();
?>

带参数范例:

<?php
 wp_tag_cloud(’number=30&smallest=12&largest=12&unit=px’);
?>

详细参数及说明:

* smallest:标签文字最小字号,默认为8pt;
* largest:标签文字最大字号,默认为22pt;
* unit:标签文字字号的单位,默认为pt,可以为px、em、pt、百分比等;
* number:调用的标签数量,默认为45个,设置为“0”则调用所有标签;
* format:调用标签的格式,可选“flat”、“list”和“array”,默认为“flat”平铺,“list”为列表方式,“array”请参考这里;
* orderby:调用标签的排序,默认为“name”按名称排序,“count”则按关联的文章数量排列;
* order:排序方式,默认为“ASC”按正序,“DESC”按倒序,“RAND”按任意顺序。
* exclude:排除部分标签,输入标签ID,并以逗号分隔,如“exclude=1,3,5,7”不显示ID为1、3、5、7的标签;
* include:包含标签,与exclude用法一样,作用相反,如“include=2,4,6,8”则只显示ID为2、4、6、8的标签。

参考资料:http://codex.wordpress.org/Template_Tags/wp_tag_cloud

Posted by & filed under Programming.

Before you can begin a transaction, you must first open the connection. You begin your transaction and then assign any newly created command objects to that transaction and perform queries as necessary. Commit the transaction. If an error occurs, Rollback the transaction in a catch statement to void out any changes and then rethrow the error so that the application can deal with it accordingly. The connection is properly closed in the finally statement, which gets called no matter what, and any unmanaged resources are disposed when the using statement calls Dispose() on the connection. Pretty simple solution to a fairly advanced topic.

The above template could actually implement a second c# using statement around command, because SqlCommand also implements IDisposable. I don’t know that it is really necessary, however. More theoretical than probably anything. I just like to see using statements around anything that implements IDisposable:

 
using (SqlConnection connection =
            new SqlConnection(connectionString))
{
    using (SqlCommand command =
            connection.CreateCommand())
    {
        SqlTransaction transaction = null;
 
        try
        {
            // BeginTransaction() Requires Open Connection
            connection.Open();
 
            transaction = connection.BeginTransaction();
 
            // Assign Transaction to Command
            command.Transaction = transaction;
 
            // Execute 1st Command
            command.CommandText = "Insert ...";
            command.ExecuteNonQuery();
 
            // Execute 2nd Command
            command.CommandText = "Update...";
            command.ExecuteNonQuery();
 
            transaction.Commit();
        }
        catch
        {
            transaction.Rollback();
            throw;
        }
        finally
        {
            connection.Close();
        }
    }
}