spacedotworks

Processing Japanese text

For processing of Japanese (or CJK) text, there are various aspects to consider:

1. Optical character recognition

Considering that these text are not "spelled" like in most western languages, identifying a character/word is not a simple task. Libraries that are available for such a task include:

  • Tesseract + Leptonica libraries - which includes training functions for learning new languages
    • Box files are necessary for training a language file and jTessBoxEditor is a good cross-platform example
  • OpenCV
    • While a Tesseract installation comes with the graphical library Leptonica, OpenCV is a widely-used pre-processing tool for both video and image recognition / feature extraction
    • Useful for identifying borders and lines

2. Breaking down text into morphological bits

This is another non-trivial task, particularly with Japanese. The presence of okurigana and multiple ways of writing the same word means that some intelligence is needed to separate words within sentences. This is particularly important if you want to perform text glossing.

  • mecab is a library that does exactly the above
    • Built for many platforms and can be compiled even as a system command

3. Dictionaries

Finally, something that is more easily obtainable. There are many projects out there including the Tanaka Corpus, the WWWJDIC project and Tatoeba that provide raw files, API access, and what not, for doing dictionary translations.

3. Japanese text

Text encoding is one aspect that frequently requires work when attempting to display Japanese text. Apart from using compatible HTML-encodings, SQL tables and parsing functions from various programming/scripting languages also have to be configured to deal with Japanese text. 

One interesting way to display Japanese text is via the Ruby (not language) HTML tags that have been introduced in HTML5. Apart from Opera and Firefox, all other browsers can now display Ruby tags. Ruby tags allow extra annotations above Japanese characters, also known as furigana. Alternatively, romaji can be added instead of hiragana, to help users who are new to Japanese.

Flexible Layout

Our website has a flexible layout and is mobile-friendly too! Try resizing your browser window to see the elastic design!

Beta version of Shizen!
shizen

Real-time Earthquake Alert is a page which collects 'live' earthquake reports from the Japan Meteorological Agency and P2P sources. The data is also translated into English on the fly. The reports are collected within seconds their creation.

 

quake