spacedotworks

Useful PHP tricks when dealing with Japanese text

1. Declaring a suitable encoding before trying to get Japanese text from system commands (via exec or shell_exec)

putenv('LANG=en_US.UTF-8');

exec();

2. Check if a string contains a Japanese character.

preg_match('/[\x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF}]/u', $key))

3. Set encoding before using sql query in PHP

$con=mysqli_connect($config['host'], $config['username'], $config['password'], $config['dbname']);

mysqli_set_charset($con,"utf8");

Processing Japanese text

For processing of Japanese (or CJK) text, there are various aspects to consider:

1. Optical character recognition

Considering that these text are not "spelled" like in most western languages, identifying a character/word is not a simple task. Libraries that are available for such a task include:

  • Tesseract + Leptonica libraries - which includes training functions for learning new languages
    • Box files are necessary for training a language file and jTessBoxEditor is a good cross-platform example
  • OpenCV
    • While a Tesseract installation comes with the graphical library Leptonica, OpenCV is a widely-used pre-processing tool for both video and image recognition / feature extraction
    • Useful for identifying borders and lines

Read more: Processing Japanese text

Flexible Layout

Our website has a flexible layout and is mobile-friendly too! Try resizing your browser window to see the elastic design!

Beta version of Shizen!
shizen

Real-time Earthquake Alert is a page which collects 'live' earthquake reports from the Japan Meteorological Agency and P2P sources. The data is also translated into English on the fly. The reports are collected within seconds their creation.

 

quake