Archive for August 8th, 2007

You are currently browsing the archives of Enabling Technology .

Wordpress Stats plug-in BSUITE [Chinese]

http://www.maisonbisson.com/blog/post/10900/#section-3

bsuite是一款wordpress统计插件,它的前身是bstat,简单好用的来访者统计插件。
它的功能如下:

  1. 跟踪网页点击次数。
  2. 跟踪搜索引挚关键字。
  3. 输出点击率最高的文章。
  4. 输出最近的评论。
  5. 输出最多的搜索关键字。
  6. 输出整个网站或单篇文章的访问量脉冲图
  7. 高亮搜索单词。
  8. 在文章底部列出有联系的相关文章。
  9. bsuite_speedcache结合一体。
  10. 列出文章tag。

请到这儿点击下载,这儿是安装方法。

Posted by micas on Aug 8th 2007 | Filed in SEO | Comments (0)

Test for finding job in Baidu.com(in Chinese)

百度笔试题目(转)

题目大致是这样的:

   第一部分选择题:有几道网络相关的题目,巨简单,比如第一题是TCP、RIP、IP、FTP中哪个协议是传输层的……。有一道linux的chown使用题目。其他的全是数据结构的题目!什么链,表,码的,不知所云~~~唉,我可以没有学过数据结构的人呐!真残忍!这一部分迅速猜完!

    第二部分简答题:

    1、在linux中如何编译C程序,使之成为可执行文件?如何调试?

答案:1)检查程序中.h文件所在的目录,将其加入系统PATH中;

         2)执行C编译:#gcc [源文件名] -o [目标文件名]

            执行C++编译:#g++ [源文件名] -o [目标文件名]

         3)改变目标文件为可执行文件:#chmod +x [目标文件名]

         4)如需将多个可执行文件连续执行,可生成批处理文件:

             #vi [批处理文件名]

             可执行文件1

             可执行文件2

             ………

             最后将该批处理文件属性该位可执行。

    调试:在编译时使用-g参数,就可以使用gdb进行调试。

    2、写出内存分配和释放的函数,并指出区别。

答案:

      C语言的标准内存分配函数:malloc,calloc,realloc,free等。
      malloc与calloc的区别为1块与n块的区别:
        malloc调用形式为(类型*)malloc(size):在内存的动态存储区中分配一块长度为“size”字节的连续区域,返回该区域的首地址。
       calloc调用形式为(类型*)calloc(n,size):在内存的动态存储区中分配n块长度为“size”字节的连续区域,返回首地址。
       realloc调用形式为(类型*)realloc(*ptr,size):将ptr内存大小增大到size。
       free的调用形式为free(void*ptr):释放ptr所指向的一块内存空间。
    C++中为new/delete函数。

    3、写出socket函数,并指出其功能。

       socket():建立socket通信描述符;
       bind():将套接字和机器上的一定的端口关联;
      connect():连接到远程主机;
      listen():使套接字做好连接的准备,规定等待服务请求队列的长度;
      accept():接受连接,一旦有客户端发出连接,accept返回客户地址信息和一个新的sock;
   有了这个新的sock,双方就可以开始收发数据:
      send()和recv():用于流式套接字或者数据套接字的通讯;
      sendto()和recvfrom():用于无连接的数据报套接字;
      close():关闭套接字;
      shutdown():选择性的关闭套接字,可以只允许某一方向的通讯关闭;
      getpeername():返回流式套接字时对端peer信息;
      gethostname():返回程序所运行的机器的主机名字;
      gethostbyname():返回本机IP;

   第三部分编程题:

    1、从文件中读取字符串数据,反序显示并大小写转换。

    2、给定26字母表以及对应的密码表,编程实现加密及解密功能。

  第四部分思考题(正是传说中的字典纠错题):

     用户在输入英文单词时经常出错,现对其进行就错。给定一个正确的英文词典,考虑纠错实现。1)指出思路。2)流程、算法难易程度及可能的改进策略。

一道算法题目答案

int Replace(Stringtype &S,Stringtype T,Stringtype V);//将串S中所有子串T替换为V,并返回置换次数
{
for(n=0,i=1;i〈=Strlen(S)-Strlen(T)+1;i++) //注意i的取值范围
if(!StrCompare(SubString(S,i,Strlen(T)),T)) //找到了与T匹配的子串
{ //分别把T的前面和后面部分保存为head和tail
StrAssign(head,SubString(S,1,i-1));
StrAssign(tail,SubString(S,i+Strlen(T),Strlen(S)-i-Strlen(T)+1));
StrAssign(S,Concat(head,V));
StrAssign(S,Concat(S,tail)); //把head,V,tail连接为新串
i+=Strlen(V); //当前指针跳到插入串以后
n++;
}//if
return n;
}//Replace
分析:i+=Strlen(V);这一句是必需的,也是容易忽略的.如省掉这一句,则在某些情况下,会引起不希望的后果,虽然在大多数情况下没有影响.请思考:设S=’place’, T=’ace’, V=’face’,则省掉i+=Strlen(V);运行时会出现什么结果? (无限递归face)

百度2005年的笔试题



 
1.实现 void delete_char(char * str, char ch);

  把str中所有的ch删掉


 
2.把字符串S中所有A子串换成B,这个没给函数原型


 
3.搜索引擎的日志要记录所有查询串,有一千万条查询,不重复的不超过三百万

  要统计最热门的10条查询串. 内存<1G. 字符串长 0-255

  (1) 主要解决思路 //具体用词和原题不大一样

  (2) 算法及其复杂度分析


 
4.有字典,设计一个英文拼写纠正算法 (1) 思想 (2) 算法及复杂度 (3) 改进


 
5. { aaa, bb, ccc, dd }, { bbb, ff }, { gg } 等一些字符串的集合

  要求把交集不为空的集合并起来,如上例会得到 { aaa, bb, ccc, dd, ff }, {gg}

  (1) 思想 (2) 算法及复杂度 (3) 改进

Posted by micas on Aug 8th 2007 | Filed in SEO | Comments (0)

What People are searching for?

http://searchenginewatch.com/showPage.html?page=2156041

Posted by micas on Aug 8th 2007 | Filed in SEO | Comments (0)

Did You Mean: Lucene?

All modern search engines attempt to detect and correct spelling errors in users’ search queries. Google, for example, was one of the first to offer such a facility, and today we barely notice when we are asked “Did you mean x?” after a slip on the keyboard. This article shows you one way of adding a “did you mean” suggestion facility to your own search applications using the Lucene Spell Checker, an extension written by Nicolas Maisonneuve and David Spencer.

 

Techniques of Spell Checking

Automatic spell checking has a long history. One important early paper was F. Damerau’s A Technique for Computer Detection and Correction of Spelling Errors, published in 1964, which introduced the idea of minimum edit distance. Briefly, the concept of edit distance quantifies the idea of one string being “close” to another, by counting the number of character edit operations (such as insertions, deletions and substitutions) that are needed to transform one string into the other. Using this metric, the best suggestions for a misspelling are those with the minimum edit distance.

Another approach is the similarity key technique, in which words are transformed into some sort of key so that similarly spelled and, hopefully, misspelled words have the same key. To correct a misspelling simply involves creating the key for the misspelling and looking up dictionary words with the same key for a list of suggestions. Soundex is the best-known similarity key, and is often used for phonetic applications.

A combination of minimum edit distance and similarity keys (metaphone) is at the heart of the successful strategy used by Aspell, the leading open source spell checker. However, it is a third approach that underlies the implementation of the “did you mean” technique described in this article: letter n-grams.

A letter n-gram is a sequence of n letters of a word. For instance, the word “lucene” can be divided into four 3-grams, also known as trigrams: “luc”, “uce”, “cen”, and “ene.”. Why is it useful to break words up like this? The intuition is that misspellings typically only affect a few of the constituent n-grams, so we can recognize the intended word just by looking through correctly spelled words for those that share a high proportion of n-grams with the misspelled word. There are various ways of computing this similarity measure, but one powerful way is to treat it as a classic search engine problem with an inverted index of n-grams into words. This is precisely the approach taken by Lucene Spell Checker. Let’s see how to use it.

A Simple Search Application

We’ll first build a very simple search interface that does not include the “did you mean” facility. It defines a single method that takes a search query string and returns a search result.


package org.tiling.didyoumean;

import java.io.IOException;

import org.apache.lucene.queryParser.ParseException;

public interface SearchEngine {
    public SearchResult search(String queryString) throws IOException, ParseException;
}
  

The search result is a SearchResult object, which is a JavaBean that exposes a list of hits (actually just the top hits, for simplicity) and a few other properties. I have omitted the constructor and getters in the listing here as they are boilerplate code. (The full source code is available in the accompanying download–see the “References” section at the end of the article.)


package org.tiling.didyoumean;

import java.util.List;

public class SearchResult {

    private List topHits;
    private int totalHitCount;
    private long searchDuration;
    private String originalQuery;
    private String suggestedQuery;

}
  

Here’s a very simple implementation of SearchEngine built with Lucene. It uses Lucene’s QueryParser to parse the search query string into a Query that is then used to perform the search. The Lucene Hits object is then mapped to an instance of our SearchResult class.


package org.tiling.didyoumean;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;

public class SimpleSearchEngine implements SearchEngine {

    private String defaultField;
    private String nameField;
    private Directory originalIndexDirectory;
    private int maxHits;

    public SimpleSearchEngine(String defaultField, String nameField,
            Directory originalIndexDirectory, int maxHits) {
        this.defaultField = defaultField;
        this.nameField = nameField;
        this.originalIndexDirectory = originalIndexDirectory;
        this.maxHits = maxHits;
    }

    public SearchResult search(String queryString) throws IOException, ParseException {
        long startTime = System.currentTimeMillis();
        IndexSearcher is = null;
        try {
            is = new IndexSearcher(originalIndexDirectory);
            QueryParser queryParser = new QueryParser(defaultField, new StandardAnalyzer());
            queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
            Query query = queryParser.parse(queryString);
            Hits hits = is.search(query);
            long endTime = System.currentTimeMillis();
            return new SearchResult(extractHits(hits), hits.length(), endTime - startTime, queryString);
        } finally {
            if (is != null) {
                is.close();
            }
        }
    }

    private List extractHits(Hits hits) throws IOException {
        List hitList = new ArrayList();
        for (int i = 0, count = 0; i < hits.length() && count++ < maxHits; i++) {
            hitList.add(hits.doc(i).getField(nameField).stringValue());
        }
        return hitList;
    }
}
  

Note that an IOException may be thrown by Lucene if there is a problem reading the index (typically from disk). The finally clause closes the IndexSearcher, but propagates the exception to indicate the problem to the client, which is the MVC layer, in this case.

With these ingredients it is straightforward to write a user interface that accepts user queries and presents the search results back to the user. I chose Spring’s MVC framework for this. Since this is an article about search and not about Spring, I won’t present any of the code for the user interface here–instead, please refer to the accompanying download.

Figure 1 is a screenshot of the search interface, running against an index of texts by Beatrix Potter from Project Gutenberg.

Figure 1
Figure 1. A simple search application

Adding “Did You Mean” to the Simple Search

Next we’ll extend the search to prompt with “did you mean” suggestions for misspelled search terms in the query. Let’s go through this step by step in the following subsections.

Generating a Spell Index

The first step is to generate an index from the original index that includes the letter n-grams for each word in the original index. I shall refer to this index as the spell index. With the help of the Lucene Spell Checker, this is very easy:


package org.tiling.didyoumean;

import java.io.IOException;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.spell.Dictionary;
import org.apache.lucene.search.spell.LuceneDictionary;
import org.apache.lucene.search.spell.SpellChecker;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class DidYouMeanIndexer {
    private static final String DEFAULT_FIELD = "contents";

    private static final String FIELD_OPTION = "f";
    private static final String ORIGINAL_INDEX_OPTION = "i";
    private static final String SPELL_INDEX_OPTION = "o";

    public void createSpellIndex(String field,
            Directory originalIndexDirectory,
            Directory spellIndexDirectory) throws IOException {

        IndexReader indexReader = null;
        try {
            indexReader = IndexReader.open(originalIndexDirectory);
            Dictionary dictionary = new LuceneDictionary(indexReader, field);
            SpellChecker spellChecker = new SpellChecker(spellIndexDirectory);
            spellChecker.indexDictionnary(dictionary);
        } finally {
            if (indexReader != null) {
                indexReader.close();
            }
        }
    }

}

  

The Dictionary interface specifies a single method:

public Iterator getWordsIterator();

that returns an iterator over the words in the dictionary. Here we use a LuceneDictionary object to read each word in the given field from the original index. We then create a SpellChecker, giving it a new index location to which to write the n-grams as it indexes the dictionary.

To create the spell index, you can instantiate a new DidYouMeanIndexer and invoke the createSpellIndex() method from your code. Alternatively, you can run DidYouMeanIndexer from the command line (the main() method is not shown in the above listing).

The “Did You Mean” Search Engine

Next, let’s turn back to our SearchEngine interface and look at the implementation of DidYouMeanSearchEngine. This implementation looks for query suggestions when the search results have low relevance.


package org.tiling.didyoumean;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;

public class DidYouMeanSearchEngine implements SearchEngine {

    private String defaultField;
    private String nameField;
    private Directory originalIndexDirectory;
    private int maxHits;
    private int minimumHits;
    private float minimumScore;
    private DidYouMeanParser didYouMeanParser;

    public DidYouMeanSearchEngine(String defaultField, String nameField,
            Directory originalIndexDirectory,
            int maxHits, int minimumHits, float minimumScore,
            DidYouMeanParser didYouMeanParser) {

        this.defaultField = defaultField;
        this.nameField = nameField;
        this.originalIndexDirectory = originalIndexDirectory;
        this.maxHits = maxHits;
        this.minimumHits = minimumHits;
        this.minimumScore = minimumScore;
        this.didYouMeanParser = didYouMeanParser;
    }

    public SearchResult search(String queryString) throws IOException, ParseException {
        long startTime = System.currentTimeMillis();
        IndexSearcher is = null;
        try {
            is = new IndexSearcher(originalIndexDirectory);
            Query query = didYouMeanParser.parse(queryString);
            Hits hits = is.search(query);

            String suggestedQueryString = null;
            if (hits.length() < minimumHits || hits.score(0) < minimumScore) {
                Query didYouMean = didYouMeanParser.suggest(queryString);
                if (didYouMean != null) {
                    suggestedQueryString = didYouMean.toString(defaultField);
                }
            }

            long endTime = System.currentTimeMillis();
            return new SearchResult(extractHits(hits), hits.length(),
                    endTime - startTime, queryString, suggestedQueryString);
        } finally {
            if (is != null) {
                is.close();
            }
        }
    }

    private List extractHits(Hits hits) throws IOException {
        List hitList = new ArrayList();
        for (int i = 0, count = 0; i < hits.length() && count++ < maxHits; i++) {
            hitList.add(hits.doc(i).getField(nameField).stringValue());
        }
        return hitList;
    }

}

  
The “Did You Mean” Parser

The key difference between the DidYouMeanSearchEngine class and the SimpleSearchEngine class is the introduction of the DidYouMeanParser interface. The DidYouMeanParser interface encapsulates a strategy both for parsing query strings and for suggesting spelling corrections for query strings:


package org.tiling.didyoumean;

import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.Query;

public interface DidYouMeanParser {
    public Query parse(String queryString) throws ParseException;
    public Query suggest(String queryString) throws ParseException;
}

  

The DidYouMeanSearchEngine only asks the DidYouMeanParser for a suggested query if the number of hits returned falls below a minimum threshold (the minimumHits property), or if the relevance of the top hit falls below a minimum threshold (the minimumScore property). Of course, you may choose to implement your own criteria for when to make a “did you mean” suggestion, but this rule is simple and effective.

The first implementation of DidYouMeanParser is straightforward:


package org.tiling.didyoumean;

import java.io.IOException;

import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.spell.SpellChecker;
import org.apache.lucene.store.Directory;

public class SimpleDidYouMeanParser implements DidYouMeanParser {

    private String defaultField;
    private Directory spellIndexDirectory;

    public SimpleDidYouMeanParser(String defaultField, Directory spellIndexDirectory) {
        this.defaultField = defaultField;
        this.spellIndexDirectory = spellIndexDirectory;
    }

    public Query parse(String queryString) {
        return new TermQuery(new Term(defaultField, queryString));
    }

    public Query suggest(String queryString) throws ParseException {
        try {
            SpellChecker spellChecker = new SpellChecker(spellIndexDirectory);
            if (spellChecker.exist(queryString)) {
                return null;
            }
            String[] similarWords = spellChecker.suggestSimilar(queryString, 1);
            if (similarWords.length == 0) {
                return null;
            }
            return new TermQuery(new Term(defaultField, similarWords[0]));
        } catch (IOException e) {
            throw new ParseException(e.getMessage());
        }
    }

}

  

The parse() method simply constructs a new TermQuery from the query. (This means that SimpleDidYouMeanParser only works with single-word queries, a deficiency we shall remedy later.) The suggest() implementation is more interesting. Just as when we created the spell index earlier, we construct a new SpellChecker with the index location for the spell index. This time, however, we just read from the index. First we check if the query word is in the index–if it is, we assume that it is correctly spelled, and make no suggestion by returning null. If instead the query word is not in the index, then we ask the spell checker for a single suggestion, by invoking the suggestSimilar() method. Of course, it may happen that no words are similar enough to the input, so we return null again. But if a suggestion is found, then it is returned as a new TermQuery.

Whew! Let’s see it in action after everything has been wired up using Spring. Figure 2 is a screenshot for the misspelled query “lettice.”

Figure 2
Figure 2. Suggesting a sensible alternative query

How It Works

There’s a lot going on in the suggestSimilar() method of SpellChecker, so let’s follow it through with an example. Take the correctly spelled word “lettuce,” which appears in the Beatrix Potter texts I’ve used for this article. In the original index, where each Lucene document corresponds to a text, “lettuce” appears in two Lucene documents in the contents field. On the other hand, the spell index contains a whole Lucene document for every distinct word in the original index. Each document has a number of fields, as shown here with the values for the document representing the word “lettuce.”

Field name
Field values

word
lettuce

start3
let

gram3
let ett ttu tuc uce

end3
uce

start4
lett

gram4
lett ettu ttuc tuce

end4
tuce

Notice how both trigrams and 4-grams are indexed. In fact, precisely which n-grams are indexed depends on the size of the word. For very short words, unigrams and bigrams are indexed, whereas for longer words, trigrams and 4-grams are indexed.

The suggestSimilar() method forms a Lucene query to search the spell index for candidate suggestions. For the misspelling “lettice” the query is as follows (split over two lines to make it easier to read):


start3:let^2.0 end3:ice gram3:let gram3:ett gram3:tti gram3:tic gram3:ice
start4:lett^2.0 end4:tice gram4:lett gram4:etti gram4:ttic gram4:tice
  

The start n-grams are given more weight than the other n-grams in the word; here, they are boosted by a factor of two, signified by the ^2.0 notation. Another reason to index the start and end n-grams separately is because they are positional, unlike the other n-grams. For example, the words “eat” and “ate” have the same set of unigrams and bigrams (gram1:e gram1:a gram1:t gram2:ea gram2:at), so they need the start and end fields to distinguish them (start1:e end1:t start2:ea end2:at for “eat,” and start1:a end1:e start2:at end2:te for “ate”).

Using a Lucene index browser, such as the excellent Luke–the Lucene Index Toolbox, we can manually run this query against the spell index. Figure 3 shows what we get.

Figure 3
Figure 3. Browsing the spell index–click image for full-size screenshot

But the top hit is “letting,” not “lettuce,” which the web app presented us with. What’s going on? The answer is that the Lucene Spell Checker ranks suggestions by edit distance, not by search relevance. The string “lettice” differs from “lettuce” by a single substitution, whereas “letting” is two substitutions away.

Supporting Composite Queries

SimpleSearchEngine supports composite queries–that is, queries that are composed of a set of clauses; for example, lettuce parsley, which means “find documents in which both of the words ‘lettuce’ and ‘parsley’ appear.” As noted above, DidYouMeanSearchEngine with SimpleDidYouMeanParser only supports single-word queries, so let’s see how we can fix it to support composite queries.

CompositeDidYouMeanParser is an implementation of DidYouMeanParser for use by DidYouMeanSearchEngine that supports composite queries. Recall that the DidYouMeanParser interface has a parse() method and a suggest() method, both of which take query strings and return Lucene Query objects. The implementation of parse() is simple: it uses Lucene’s QueryParser, which has built-in support for composite queries. The implementation of suggest() is a little more tricky. It relies on the getFieldQuery() extensibility hook provided by QueryParser, so if a term (or a word in a phrase) is misspelled, then it is replaced with the best suggestion. If no terms (or words in a phrase) in the whole query are misspelled, then suggest() returns null.

Figure 4 is a screenshot for the misspelled composite query “lettice parslee.”

Figure 4
Figure 4. Correcting the spelling of multiple query terms

Ensuring High-Quality Suggestions

Having a clever algorithm for detecting and correcting spelling errors is a good start, but you need a good source of correctly spelled words to ensure the suggestions are of a high quality. So far, we have used the terms in the original index as the source of words (by constructing a LuceneDictionary). There is a downside to this approach: the content that was indexed will almost certainly contain spelling errors, so there is a good chance that certain query suggestions will be misspelled.

You might think that using a compiled word list might help. However, even the largest dictionaries fall short in word coverage for proper nouns and newly coined words (e.g., technical phrases), so a correctly spelled query term that is not in the dictionary will be incorrectly marked as a misspelling. The user would then be prompted with a distracting alternative query suggestion. (As a side note, Lucene Spell Checker provides an implementation of Dictionary, PlainTextDictionary, which can read words from a word list such as /usr/dict/words commonly found on Unix systems. Use this to do regular spell checking against a dictionary.)

Lucene Spell Checker provides a mechanism to solve this problem, while still using the original index as the source of words. The suggestSimilar() method of SpellChecker is overloaded to support secondary sorting of the suggested words by document frequency in an index; for example:


spellChecker.suggestSimilar(queryText, 1, originalIndexReader, defaultField, true);
  

This call restricts suggestions to those words that are more popular (true) in the original index than the query term. On the plausible assumption that across the whole set of documents, misspellings are less common than the correctly spelled instances of the word, this modification will improve the quality of suggestions, even in document collections containing misspellings.

Zeitgeist

Large search engines use user queries for the source of suggestions. The logic is: if you don’t understand what a user is asking for, compare it to what other users ask for, as someone else is likely to have searched for something similar.

To implement this strategy, each user query submitted to the system should be indexed in the spell index in order to provide a proper record of query frequencies. (All of the main search engines publish their most popular search terms, which are ultimately derived from such an index.) Then, by using the overloaded suggestSimilar() method introduced in the previous section, suggestions will be ranked firstly by edit distance and secondly by user popularity.

Conclusion

Spell checking users’ search queries is a nice feature, and relatively easy to add to a Lucene-powered search application, as this article has shown. Most of the time, the corrections suggested are good ones, but there is plenty of ongoing research in the information retrieval community on improving spell check algorithms (see “References,” below). I think we will continue to see the fruits of such research in open source libraries like Lucene Spell Checker.

References

Tom White is lead Java developer at Kizoom, a leading U.K. software company in the delivery of personalized travel information.

Posted by micas on Aug 8th 2007 | Filed in SEO | Comments (0)

SpellChecker Java Search API

 

July 9, 2007 on 9:20 pm | InJava|July9,2007on9:20pm|InJava|

在寫程式時,基本上都一定要為錯誤的輸入作檢查或修正。在写程式时,基本上都一定要为错误的输入作检查或修正。 這是基本可以用來檢查一個程式有沒有偷懶/偷工減料的最簡單方法。这是基本可以用来检查一个程式有没有偷懒/偷工减料的最简单方法。

在上電腦的基本課時,應該一定會提到有關 GIGO(garbagein,garbageout; 垃圾輸入, 無用輸出)。在上电脑的基本课时,应该一定会提到有关GIGO(garbagein,garbageout;垃圾输入,无用输出)。 就是說電腦很苯,當輸入的數據是垃圾,輸出就一定只會是垃圾。就是说电脑很苯,当输入的数据是垃圾,输出就一定只会是垃圾。 電腦會出錯,可是人類更易出錯,而且犯的錯誤更多。电脑会出错,可是人类更易出错,而且犯的错误更多。

最基本的,是能在輸入時即時先作出反應和指出錯誤。最基本的,是能在输入时即时先作出反应和指出错误。 最簡單是檢查數據的型態和空白(第一類),而一個比較像樣的程式都有數字、時間、日期、大小和特定格式 (如 email 或 UUID) 的 Pattern 檢查(第二類)。最简单是检查数据的型态和空白(第一类),而一个比较像样的程式都有数字、时间、日期、大小和特定格式(如email或UUID)的Pattern检查(第二类)。 造得仔細一點的程式都有會進一步的檢查,就是數據有效性的檢查;例如年月日的組合是否合理,沒有沒串錯字,重復性,在 database 能不能找到相對應的 ID 之類的(第三類)。造得仔细一点的程式都有会进一步的检查,就是数据有效性的检查;例如年月日的组合是否合理,没有没串错字,重复性,在database能不能找到相对应的ID之类的(第三类)。 而最好的則會著重與人的互動關係,清楚的錯誤說明,能在使用者保持集中力的時間內反應(好像大約三秒),自動的修正、建議(第四類)。而最好的则会着重与人的互动关系,清楚的错误说明,能在使用者保持集中力的时间内反应(好象大约三秒),自动的修正、建议(第四类)。

實例实例

第一類:Yahoo 字典第一类:Yahoo字典
雖然它說 “請輸入單字查詢,中英文皆可” ,可是日文,法文,德文也能過。虽然它说“请输入单字查询,中英文皆可”,可是日文,法文,德文也能过。 假如輸入簡体字或日文漢字它也不會了解。假如输入简体字或日文汉字它也不会了解。

第二類: 一般的網頁上常見的 “E-mail this page” 表格(我一直很好奇誰會用)。第二类:一般的网页上常见的“E-mailthispage”表格(我一直很好奇谁会用)。 例如這頁的 footer。例如这页的footer。
它會要求你依一定的格式輸入 email,可是在都不會知道這個 email 是否正確的。它会要求你依一定的格式输入email,可是在都不会知道这个email是否正确的。

第三類: 現時 forum 的 Signup form。第三类:现时forum的Signupform。
它會以寄出 validation code 的方式檢查你輸入的 email。它会以寄出validationcode的方式检查你输入的email。

第四類: Google Suggest / Gmail第四类:GoogleSuggest/Gmail
能在在未按下 sutmit 前給建議,修正,甚至 auto-complete。能在在未按下sutmit前给建议,修正,甚至auto-complete。

以上例子都是 web-application 的原因只是誰也看得到。以上例子都是web-application的原因只是谁也看得到。 事實上 client application 的例子更多。事实上clientapplication的例子更多。
如: Notepad vs UltraEdit vs Microsoft Word vs Eclipse IDE.如:NotepadvsUltraEditvsMicrosoftWordvsEclipseIDE.

好像越說越遠了。好像越说越远了。 回到正題,其中最麻煩,最難實作的是自動化的修正建議。回到正题,其中最麻烦,最难实作的是自动化的修正建议。 因為它的目的不做到 GIGO,這不是和上文所說的電腦很苯相反嗎?因为它的目的不做到GIGO,这不是和上文所说的电脑很苯相反吗? 沒有矛盾,它是建基在數據當中正確的部份。没有矛盾,它是建基在数据当中正确的部份。

再回到主題,英文串字修正是以沒串錯的部份基礎。再回到主题,英文串字修正是以没串错的部份基础。 (從語言角度上著手也可以,不過這是人類善長的工作,不在本文內容)(从语言角度上着手也可以,不过这是人类善长的工作,不在本文内容)
英文是由字母組合而成,而它們的組合次序和方式,則可以提示出正確的建議。英文是由字母组合而成,而它们的组合次序和方式,则可以提示出正确的建议。

用實例來說或會比較易明白: 例如 Monstor (Monster 的誤寫)。用实例来说或会比较易明白:例如Monstor(Monster的误写)。 以 n-Gram 方式拆解的話,就可以得到 mon, ons, nst, sto, tor。以n-Gram方式拆解的话,就可以得到mon,ons,nst,sto,tor。 如果和 Monster 比較(mon, mon, nst, ste, ter),其中 3/5 * 3/5 都是對的(因為是互相比較, 所以是兩組)。如果和Monster比较(mon,mon,nst,ste,ter),其中3/5*3/5都是对的(因为是互相比较,所以是两组)。 而如果和 “monsters inc” 比較,則只有 3/5 * 3/8。而如果和“monstersinc”比较,则只有3/5*3/8。 而拿來和 apple 比較,則完全不付合。而拿来和apple比较,则完全不付合。 只要事先為字並建立 index,用這方法可以快速得到幾個相近的詞語。只要事先为字并建立index,用这方法可以快速得到几个相近的词语。

一般,英文只要有 3-gram 和 4-gram 就可以有不錯的結果。一般,英文只要有3-gram和4-gram就可以有不错的结果。 當然啦,這等零件老早就有寫了出來。当然啦,这等零件老早就有写了出来。 如果你是用 Lucene 1.4 的話,花一點心思也可以自己寫出來,而如果用 Lucene 2.0,則已經有現成的 library 可用。如果你是用Lucene1.4的话,花一点心思也可以自己写出来,而如果用Lucene2.0,则已经有现成的library可用。

參考:  SpellChecker - Lucene-java Wiki参考:SpellChecker-Lucene-javaWiki
參考:  Lucene 2.0org.apache.lucene.search.spell参考:Lucene2.0org.apache.lucene.search.spell

就算不用 Lucene,自已利用 php+mysql 也可以寫得到類似的功能,只要依著以上的方式建立 index table 和設計出一句 inner join SQL 就可以。就算不用Lucene,自已利用php+mysql也可以写得到类似的功能,只要依着以上的方式建立indextable和设计出一句innerjoinSQL就可以。

9 July 20079July2007
DennisDennis

Posted by micas on Aug 8th 2007 | Filed in SEO | Comments (0)