Java port of Mozilla's Automatic Charset Detection

Java port of Mozilla's Automatic Charset Detection


What is jchardet ?
How do I build these libraries ?
How do I play around with it ? I want to test some web pages.
How will I integrate this code with my project ?
Show me a sample implementation.
How does the algorithm work ?
What problem jchardet address ?
Any bugs reported ?

What is jchardet ?

jchardet is a java port of the source from mozilla's automatic charset detection algorithm. The original author is Frank Tang. What is available here is the java port of that code. The original source in C++ can be found from http://lxr.mozilla.org/mozilla/source/intl/chardet/ More information can be found at http://www.mozilla.org/projects/intl/chardet.html

How do I build these libraries ?

There is build.xml at the root directoy. If you have Apache ant installed, Just type "ant".
Note: There is already a chardet.jar file supplied under dist/lib/chardet.jar, In case you dont want to compile

How do I play around with it ? I want to test some web pages.

There is a sample implementation called HtmlCharsetDetector class that is supplied with the package.This class fetches the given HTML page and pass it to the AutoDetect engine and outputs the detected charset.
To run the sample...
 
cd dist/lib                                                                     
java -classpath chardet.jar org.mozilla.intl.chardet.HtmlCharsetDetector   

How will I integrate this code with my project ?

The procedure is simple...

First implement the interface nsICharsetDetectionObserver in the class where you want the detected charset to be notified. The interface just need to implement one function Notify(). This function will b e called and the final result will be passed whenever the engine positively iden tifies a charset.

package org.mozilla.intl.chardet ;

import java.lang.* ;

public interface nsICharsetDetectionObserver {

        public void Notify(String charset) ;
}

Second, initialize the class nsDetector. If you find a non-ascii character in your stream then start feeding data to the DoIt() member funtion.

Finally, once you are done with the input streeam, call DataEnd(). By this time the engine should have notified the detected charset. See src/HtmlCharsetDetector.java for sample implementation.

Show me a sample implementation.

Code from HtmlCharsetDetector.java
        // Initalize the nsDetector() ;
        int lang = (argv.length == 2)? Integer.parseInt(argv[1])
                                         : nsPSMDetector.ALL ;
        nsDetector det = new nsDetector(lang) ;

        // Set an observer...
        // The Notify() will be called when a matching charset is found.

        det.Init(new nsICharsetDetectionObserver() {
                public void Notify(String charset) {
                    HtmlCharsetDetector.found = true ;
                    System.out.println("CHARSET = " + charset);
                }
        });

        URL url = new URL(argv[0]);
        BufferedInputStream imp = new BufferedInputStream(url.openStream());

        byte[] buf = new byte[1024] ;
        int len;
        boolean done = false ;
        boolean isAscii = true ;

        while( (len=imp.read(buf,0,buf.length)) != -1) {

                // Check if the stream is only ascii.
                if (isAscii)
                    isAscii = det.isAscii(buf,len);

                // DoIt if non-ascii and not done yet.
                if (!isAscii && !done)
                    done = det.DoIt(buf,len, false);
        }
        det.DataEnd();

        if (isAscii) {
           System.out.println("CHARSET = ASCII");
           found = true ;
        }

How does the algorithm work ?

The way browsers handle this problem is to look in to the data byte-by-byte and try to guess the charset (When you click on the menu View->Auto-Select or Auto-Detect). The algorithm (originally developed by Frank Tang) looks into the byte sequence and based on the values of each byte uses a elimination logic to narrow down to the final charset. If there is a tie between EUC charsets, it uses the second logic to narrow down. This logic uses the frequency statistics of characters in a given language.

What problem jchardet address ?

Problem Statement:
The Java string (and char) class store data in Unicode values. When handling international text from outside source we need to provide information about the encoding of the text so that they are converted to correct Unicode values. This means you have to know the encoding of all the text that your Java code handles. Many Internet based Java application has to deal with data from random source and the encoding is not always explicitly known. E.g. in a HTML page, if there is no meta-tag explicitly specifying the charset of the page, it is very hard to determine the encoding and the conversio n to Java Unicode string will end up corrupting the data.

Any bugs reported ?

Brian Guan writes...
>
> Hi,
>
>
>
> I've just downloaded jchardet and realized it had a
> bug when
>
> I tried out this site:
>
> http://java.sun.com/j2se/1.4/ja/docs/ja/index.html
>
>
>
> HtmlCharsetDetector incorrectly report that this
> site is ASCII.
>
>
>
> So I browsed and adjusted the code a little to:
>
>
>
> ...
>
>
>       // boolean isAscii = true ;
>
>       boolean isAscii = false ;
>
>
>
>       while((!done) && ((len=imp.read(buf,0,buf.length)) != -1)) {
>
>
>
>               // Check if the stream is only ascii.
>
>               // if (isAscii)
>
>                   isAscii = det.isAscii(buf,len);
>
>
>
>               // DoIt if non-ascii and not done yet.
>
>               // if (!isAscii && !done)
>
>                   done = det.DoIt(buf,len, false);
>
>       }
>
>       det.DataEnd();
>
> ...
>
>
>
> And it now reports that the probable encoding is:
>
> CHARSET = ISO-2022-JP
>
> CHARSET = ASCII
>