Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: hybrid conversion engine #11

Closed
GoogleCodeExporter opened this issue Apr 22, 2015 · 3 comments
Closed

Suggestion: hybrid conversion engine #11

GoogleCodeExporter opened this issue Apr 22, 2015 · 3 comments

Comments

@GoogleCodeExporter
Copy link

I heard you are using 最小コスト法 for Mozc, and it's good for long 
kana-kanji conversion.
I think Mozc is better than Anthy for long sentenses.

But many people say Mozc is not good for short kana-kanji conversion.
e.g. Mozc doesn't show "午後" for "gogo".

I saw "パルックボールプレミアQ" at the local store -
http://panasonic.co.jp/corp/news/official.data/data.dir/jn080609-2/jn080609-2.ht
ml
電球形蛍光灯 is not bright on start up, so Panasonic added old 
白熱電球 for first few minutes.
http://www.lightstyle.jp/?cn=100026&shc=10000503
> 
パルックボールプレミアクイックとは、蛍光ランプ内に白��
�灯フィラメント内蔵で、
> ランプの立ち上がりが早いという優れもの。

And I've found Taku-san's tweet:
http://twitter.com/taku910/status/15560702282
> 
N文節最長一致は、文節単位で細かく入力する場合に限って�
��うまく働くような気がする。

That's it!
Is it possible to switch conversion engines?

if yomigana_length <= 10 characters
 mozc_engine = N文節最長一致
else
 mozc_engine = 最小コスト法
end

if yomigana == "ごご", Mozc use N文節最長一致.

What do you think about it?

Original issue reported on code.google.com by heathros...@gmail.com on 14 Jun 2010 at 7:24

@GoogleCodeExporter
Copy link
Author

Thank you for the proposal.

Combining N文節最長一致 might be a reasonable solution, but I think we 
should  have a lot of things to do before trying N文節最長一致.
I think we can fix the issue only with the current algorithm.

Let me summarize why we don't have any plans for using N文節最長一致 at 
this moment. 

1. No theoretical justification for N文節最長一致, i.e., From statistical 
natural processing point of view, we cannot explain how and why 
N文節最長一致 works. I've never met  any NLP researchers who prefer 
N文節最長一致 over コスト最小法.  コスト最小法 has strong 
theoretical background/supports.
2. The reason why the current algorithm fails to convert short inputs is that 
we are using complete sentences for making a language model, which is not 
always equivalent to the typical input 'unit'. The connection probabilities to 
end of sentence or beginning of sentence are now underestimated. I believe that 
the current issue can be solved by tweaking the training data.
3. We'd like to make the conversion engine (code) as simple as possible.  
Ideally, the conversion quality should only be determined by the language model 
and dictionary. Otherwise, we have to look into both conversion algorithm and 
language model once mis-conversion occur,  which will get debugging to be more 
difficult. Actually, since we started the project, core conversion algorithm 
has not been updated. Nevertheless, we have been able to improve the quality by 
updating language model and dictionary.

Anyway, we'd like to appreciate it if you send us sentences or phrases moze 
failed to convert.

Original comment by t...@google.com on 16 Jun 2010 at 4:57

@GoogleCodeExporter
Copy link
Author

> we can fix the issue only with the current algorithm.

Great news!

> From statistical natural processing point of view, 
> we cannot explain how and why N文節最長一致 works.

Hmm, maybe N文節最長一致 is like 占い or ムー. :-)

> The reason why the current algorithm fails to convert short inputs is that 
> we are using complete sentences for making a language model, which is not 
> always equivalent to the typical input 'unit'.

OK, you will improve Mozc for the units.

Thank you for the reply, we will see better Mozc in the next release.

Original comment by heathros...@gmail.com on 16 Jun 2010 at 2:40

@GoogleCodeExporter
Copy link
Author

Original comment by t...@google.com on 16 Jun 2010 at 11:36

  • Changed state: WontFix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant