NLP for Nepali Text – Console Printing & DB Handling
Nepali language uses non-latin or Devanagari font. So unlike regular english language, dealing with text like Nepali requires special tools and techniques. The struggle starts with the following two things:
- Unable to print the text in console (linux terminal /windows command line). They are unreadable.
- Reading/writing text in database
Printing Nepali text in Console/Terminal
While doing development or running some machine learning libraries, we frequently need to print out the text in terminal. For nepali text, getting them printed out in terminal is challenging. But there are solutions to it.
There are many different solutions to print unicode characters to display in format of UTF-16 and UTF-8 for languages like C or C++.
In C++,_setmode(_fileno(stdout), _O_U16TEXT);
This settings works but not to all languages like Nepali, Arabic or Hebrew. It was not helpful since it did not correctly display all Nepali text. Furthermore, my work was also requiring to print out Nepali text from python language interpreter into the terminal. For it, in linux terminal, I tried to setup all languages and locals and managed to display Nepali text but it also had limitations of displaying all characters correctly. Not all words were corrected displayed and looked like I was still lacking the actual fonts (tried more than 10 that should have supported devanagari font).
So instead of wasting my time in trying with every available fonts and settings, I decided to to with an awesome new terminal called Konsole. You can download it from here, https://konsole.kde.org/. I was able to install it so easily with a single line command and get it running into my Ubuntu 18 machine straight away.
sudo apt-get install konsole
It was of the version 4:17.12.3 at the time of my experimentation. After the installation, Just run it from GUI and we will be presented with terminal similar to linux inbuilt terminal (For linux variations having GUI). All nepali texts were correctly displayed and another interesting feature of it was it has all the features and functionalities that a normal terminal has. So you can run any command with it and play along with any unicode non-latin texts.
Reading-Writing Nepali text in Database
Another challenge in NLP work for Nepali text is properly storing and fetching data in/out from database. I mostly worked with MySQL/Mariadb for it so typical example will be for this database but the concept will be equally applicable to other data storage technologies.
Storage
The first and most important setup to do is to make sure the mysql character set and collation is set to utf8. Personally I prefer utf8_general_ci. This is how we can do it after logging into mysql terminal or from phpmyadmin SQL Query input. There are various ways to do it like via my.cnf of db config file etc but the simplest way is to run straight query to the engine for the particular database or table.
ALTER DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;
This will change the character set and collation of your whole database. Do not forget to replace the dbname by your actual database name.
To do this only for particular table:
ALTER TABLE tablename CHARACTER SET utf8 COLLATE utf8_general_ci;
Reading/Writing from popular programming language
Python
Python handles nicely the encodings without any extra parameters while reading from mysql database. Install mysql connector through pip and import it.
pip install mysql-connector-python
You can choose the specific version if you like
pip install mysql-connector-python==8.0.11
import mysql import mysql.connector mydb = mysql.connector.connect( host="localhost", user="manoj", passwd="pass", database="db" ) mycursor = mydb.cursor() mycursor.execute("SELECT fields FROM table") result = mycursor.fetchall()
And do the straight query as above (not focused here on best practices and structured way)
For writing, straight execute function will again work as long you have maintained the character encoding and collation.
Java
For connecting mysql with java application, I used mysql.jdbc.Driver. The jar file after downloading from mysql official site was added into the CLASSPATH so that Java Virtual Machine or the Java compiler know the location of this package.
export CLASSPATH=:/home/manoj/commons-lang3-3.9.jar:.:/usr/share/java/mysql-connector-java.jar:.:/home/manoj/commons-text-1.6.jar
Note: Had to use commons-lang and commons-text packages as well for text processing.
I think had to explicitly define the characterEncoding while connecting to db. Since I was in local and no SSL was setup for local connection, set the useSSL to False.
String myDriver = "com.mysql.jdbc.Driver"; String myUrl = "jdbc:mysql://localhost:3306/lda?characterEncoding=UTF-8&useSSL=false"; try { Class.forName(myDriver); } catch (Exception ex) { System.out.println("could not connect to mysql. exiting."); } Connection conn = DriverManager.getConnection(myUrl, "root", "manoj");
Make sure your mysql is running on 3306 port and db username/password and db name matches to yours.
I was able to get the Nepali text by simple fetch query
String query = "SELECT * FROM table"; // create the java statement Statement st = conn.createStatement(); ResultSet rs = st.executeQuery(query);
For inserting the data, I was able to do it once the connection was made with defined characterEncoding.
query = "INSERT INTO …….."; st.addBatch(query); st.executeBatch();
PHP
The character set of mysql had to be set to utf8 to read and write in PHP.
// Create connection $conn = new mysqli(‘localhost’, ‘manoj’, ‘db’, ‘db’); // Check connection if ($conn->connect_error) { die("Connection failed: " . $conn->connect_error); } //set character set mysqli_query($conn, "set names 'utf8'"); $sql = "SELECT * from table"; $result = mysqli_query($conn, $sql);
The same connection object $conn was able to insert data without any further processing specific to character encoding.
Note: All of the above examples assume you have already set the character and collate to utf-8 in your mysql database.