Extracting only Thai text using Python

import codecs
import re
import sys

for line in codecs.open(sys.argv[1], encoding="UTF8", errors='ignore'):
    for unit in re.split('([\u0E00-\u0EFF]+)', line):
        if re.match('[\u0E00-\u0EFF]+', unit):
            print(unit)

(I used Python 3.2.x)

ใส่ความเห็น

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / เปลี่ยนแปลง )

Twitter picture

You are commenting using your Twitter account. Log Out / เปลี่ยนแปลง )

Facebook photo

You are commenting using your Facebook account. Log Out / เปลี่ยนแปลง )

Google+ photo

You are commenting using your Google+ account. Log Out / เปลี่ยนแปลง )

Connecting to %s