Lexical score ใน Moses

ใน moses นี้ train-factored-phrase-model.perl เป็นโปรแกรมหลักในการคิดคะแนน ต่างๆ ของ phrase

lexical score เริ่มจากนับๆ คำก่อน อันนี้ก็ทำใน train-factored-phrase-model.perl เลยใน sub routine ชื่อ get_lexical

แต่ว่าตอนคิดคะแนนของแต่ละ phrase ไปทำใน score.cpp แทน แต่ว่าก็อ่าน lexical table จาก ที่สร้างไว้ใน get_lexical แล้ว ไฟล์ชื่อประมาณ lex….f2n และ lex…..n2f พออ่านมาได้แล้วก็คิดคะแนนกันในตอนท้ายๆ ของ processPhrasePairs ใน score.cpp

Moses hypothesis (search graph)

We can get Moses search graph by using the option -output-search-graph. Its format is explained at http://www.statmt.org/moses/?n=Moses.AdvancedFeatures. However after I have read it, I still did not understand many thing especially “covered” and “stack”. (“recombined” is not in scope of this post.) Both “covered” and “stack” are related covered foreign (source) language words. Stack is number of covered words and “covered” is referred to start and end position of covered words. These are trivial. However probably because of my English is bad, from the explanation, I did not understand that “covered” referred to “current” or “latest” translated words but “stack” is referred translated words including previous (ancestor) hypotheses too.

Example:
0 hyp=859 stack=2 back=80 score=-7.85572 transition=-1.97807 forward=11192 fscore=-16.5095 covered=4-4 out=ที่

0 hyp=11 stack=1 back=0 score=-5.63929 transition=-5.63929 forward=160 fscore=-15.6831 covered=0-0 out=ฉัน ไม่ รู้

0 hyp=80 stack=1 back=0 score=-5.87765 transition=-5.87765 forward=880 fscore=-15.1593 covered=3-3 out=กิน อาหาร

0 hyp=859 stack=2 back=80 score=-7.85572 transition=-1.97807 forward=11192 fscore=-16.5095 covered=4-4 out=ที่


hypothesis_m

For example, hypothesis 859’s stack=2 since it is include the word “eat” (3-3) from hypothesis 80 too. Previous (ancestor) hypotheses can be traced by looking at “back” (back pointer). “out” is translated string (in target language). It is directly corresponded to “covered” since “out” does not contain translated string from previous hypotheses but only current one.

In brief, “out” and “covered” is about current translation made by a hypothesis only but “stack” is about current + what are inherited from previous hypotheses (ancestors). And previous hypotheses can be traced by “back”.

Reading phrase table (for Moses) using Python

I’m going to analyze phrase table that is generated by Moses. So I have studied phrase table format from http://www.statmt.org/moses/?n=FactoredTraining.ScorePhrases and written a Python script for reading a phrase table into Python dict. The code is as follow.

import re

def _decode_tokens(field):
    return filter(lambda t: t != '', re.split(" ", field))

def _decode_link(link):
    m = re.match("\((.*)\)", link)
    if m:
        toks = filter(lambda l: l != '', re.split(",", m.group(1)))
        return map(lambda l: int(l), toks)
    else:
        raise RuntimeError

def _decode_links(field):
    links = filter(lambda t: t != '', re.split(" ", field))
    return map(_decode_link, links)

def _decode_num(field):
    toks = filter(lambda t: t != '', re.split(" ", field))
    return map(lambda tok: float(tok), toks)

def read_phrase_table(filename):
    NUM_FIELD = 5
    for i, line in enumerate(open(filename)):
        fields = re.split("\|\|\|", line.strip())
        if len(fields) != NUM_FIELD:
            raise RuntimeError
        phrase = {}
        phrase['source'] = _decode_tokens(fields[0])
        phrase['target'] = _decode_tokens(fields[1])
        phrase['links'] = _decode_links(fields[2])
        phrase['rev_links'] = _decode_links(fields[3])
        nums = _decode_num(fields[4])
        phrase['phrase_trans_prob'] = nums[0]
        phrase['lex_weight'] = nums[1]
        phrase['rev_phrase_trans_prob'] = nums[2]
        phrase['rev_lex_weight'] = nums[3]
        phrase['phrase_penalty'] = nums[4]
        yield phrase

def main():
    for phrase in read_phrase_table("phrase-table.0-0"):
        print phrase

if __name__ == '__main__':
    main()

Simple + dirty Python binding for Moses (SMT decoder)

I want to call Moses from python like below:

from moses import Moses
m = Moses("enth/model/moses.ini")
print m.decode("i eat rice")
# result: ฉัน กิน ข้าว

so i create this extension.

moses.cpp:
#include
#include “structmember.h”
#include “Parameter.h”
#include “StaticData.h”
#include “Manager.h”
#include “Hypothesis.h”
#include
#include
#include

typedef struct {
PyObject_HEAD
Parameter *param;
const StaticData *staticData;
vector *weights;
vector *inputFactorOrder, *outputFactorOrder;
FactorMask *inputFactorUsed;
long translation_id;
} Moses;

static int
Moses_traverse(Moses *self, visitproc visit, void *arg)
{
return 0;
}

static int
Moses_clear(Moses *self)
{
// FIXME
return 0;
}

static void
Moses_dealloc(Moses* self)
{
Moses_clear(self);
self->ob_type->tp_free((PyObject*)self);
}

static PyObject *
Moses_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
{
Moses *self;

self = (Moses *)type->tp_alloc(type, 0);

if (self != NULL) {
self->param = new Parameter();
if(self->param == 0) {
Py_DECREF(self);
return NULL;
}
self->staticData = &(StaticData::Instance());
if(self->staticData== 0) {
Py_DECREF(self);
return NULL;
}

self->weights = new vector;
if(self->weights == 0) {
Py_DECREF(self);
return NULL;
}

self->inputFactorOrder = new vector();
if(self->inputFactorOrder == NULL) {
Py_DECREF(self);
return NULL;
}

self->outputFactorOrder = new vector();
if(self->outputFactorOrder == NULL) {
Py_DECREF(self);
return NULL;
}

self->inputFactorUsed = new FactorMask();
if(self->inputFactorUsed == NULL) {
Py_DECREF(self);
return NULL;
}

self->translation_id = 0;

return (PyObject *)self;
}

static int
Moses_init(Moses *self, PyObject *args, PyObject *kwds)
{
const char *ini_path;

if(!PyArg_ParseTuple(args, “s”, &ini_path))
return -1;
if(!(self->param->LoadParam(std::string(ini_path))))
return -1;

if (!StaticData::LoadDataStatic(self->param))
return -1;

if(self->weights) {
delete self->weights;
self->weights = new vector(self->staticData->GetAllWeights());
}

if(self->weights->size() != self->staticData->GetScoreIndexManager().GetTotalNumberOfScores())
return -1;

if(self->inputFactorOrder && self->outputFactorOrder && self->inputFactorUsed) {
delete self->inputFactorOrder;
delete self->outputFactorOrder;
delete self->inputFactorUsed;
self->inputFactorOrder = new vector(self->staticData->GetInputFactorOrder());
self->outputFactorOrder = new vector(self->staticData->GetOutputFactorOrder());
self->inputFactorUsed = new FactorMask(*self->inputFactorOrder);
}
return 0;
}

static PyMemberDef Moses_members[] = {
/*
{“first”, T_OBJECT_EX, offsetof(Moses, first), 0,
“first name”},
{“last”, T_OBJECT_EX, offsetof(Moses, last), 0,
“last name”},
{“number”, T_INT, offsetof(Moses, number), 0,
“moses number”}, */
{NULL} /* Sentinel */
};

static PyObject *
Moses_decode(Moses* self, PyObject *args)
{
const char *source_sentence;
if(!PyArg_ParseTuple(args, “s”, &source_sentence))
return NULL;
std::stringstream s;
s < < source_sentence <inputFactorOrder)) {
if (long x = source.GetTranslationId()) {
if (x >= self->translation_id) {
self->translation_id = x + 1;
}
} else {
source.SetTranslationId(self->translation_id++);
}
Manager manager(source, self->staticData->GetSearchAlgorithm());
manager.ProcessSentence();
const Hypothesis *hypo = manager.GetBestHypothesis();
PyObject* result = PyList_New(0);
while(hypo != NULL) {
stringstream phrase_stream;
phrase_stream < GetCurrTargetPhrase();
PyList_Append(result,
PyString_FromString(phrase_stream.str().c_str()));
hypo = hypo->GetPrevHypo();
}
PyList_Reverse(result);
return result;
} else {
PyErr_SetString(PyExc_RuntimeError, “Input cannot be read properly”);
return NULL;
}
}

static PyMethodDef Moses_methods[] = {
{“decode”, (PyCFunction)Moses_decode, METH_VARARGS,
“Return decoded target language”},
{NULL} /* Sentinel */
};

static PyTypeObject MosesType = {
PyObject_HEAD_INIT(NULL)
0, /*ob_size*/
“moses.Moses”, /*tp_name*/
sizeof(Moses), /*tp_basicsize*/
0, /*tp_itemsize*/
(destructor)Moses_dealloc, /*tp_dealloc*/
0, /*tp_print*/
0, /*tp_getattr*/
0, /*tp_setattr*/
0, /*tp_compare*/
0, /*tp_repr*/
0, /*tp_as_number*/
0, /*tp_as_sequence*/
0, /*tp_as_mapping*/
0, /*tp_hash */
0, /*tp_call*/
0, /*tp_str*/
0, /*tp_getattro*/
0, /*tp_setattro*/
0, /*tp_as_buffer*/
Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE | Py_TPFLAGS_HAVE_GC, /*tp_flags*/
“Moses objects”, /* tp_doc */
(traverseproc)Moses_traverse, /* tp_traverse */
(inquiry)Moses_clear, /* tp_clear */
0, /* tp_richcompare */
0, /* tp_weaklistoffset */
0, /* tp_iter */
0, /* tp_iternext */
Moses_methods, /* tp_methods */
Moses_members, /* tp_members */
0, /* tp_getset */
0, /* tp_base */
0, /* tp_dict */
0, /* tp_descr_get */
0, /* tp_descr_set */
0, /* tp_dictoffset */
(initproc)Moses_init, /* tp_init */
0, /* tp_alloc */
Moses_new, /* tp_new */
};

static PyMethodDef module_methods[] = {
{NULL} /* Sentinel */
};

#ifndef PyMODINIT_FUNC /* declarations for DLL import/export */
#define PyMODINIT_FUNC void
#endif
PyMODINIT_FUNC
initmoses(void)
{
PyObject* m;

if (PyType_Ready(&MosesType) < 0)
return;

m = Py_InitModule3("moses", module_methods,
"Example module that creates an extension type.");

if (m == NULL)
return;

Py_INCREF(&MosesType);
PyModule_AddObject(m, "Moses", (PyObject *)&MosesType);
}

setup.py:
from distutils.core import setup, Extension

module1 = Extension('moses',
define_macros = [('MAJOR_VERSION', '0'),
('MINOR_VERSION', '1')],
include_dirs = ['../moses/src'],
libraries = ['moses', 'z', 'oolm', 'dstruct', 'misc'],
library_dirs = ['../moses/src'],
sources = ['moses.cpp'])

setup (name = 'moses',
version = '0.1',
description = 'This is a demo package',
author = 'Vee Satayamas',
author_email = 'vsatayamas@gmail.com',
url = 'http://www.python.org/doc/current/ext/building.html&#039;,
long_description = '''
This is really just a demo package.
''',
ext_modules = [module1])

การแปลภาษาไทย – อังกฤษโดยใช้สถิติ

การแปลภาษาโดยใช้สถิติ เดี๋ยวนี้มีโปรแกรม open source ให้ download กันมาใช้แล้ว. เช่น GIZA++ ที่เอาไว้เตรียม model และ Moses ที่เอาไว้แปล (decode). โดยใช้งานตาม diagram ข้างล่าง.

smt_flow.png

สิ่งที่เหมือนขาดไปคือ “โปรแกรมตัดประโยคภาษาไทย”. แต่ถ้าจะแปลอังกฤษเป็นไทยก็มีนะ. และคลังข้อความขนานที่มันใหญ่พอ.