Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPAdic license #20

Closed
GoogleCodeExporter opened this issue Apr 22, 2015 · 10 comments
Closed

IPAdic license #20

GoogleCodeExporter opened this issue Apr 22, 2015 · 10 comments

Comments

@GoogleCodeExporter
Copy link

Is IPAdic becoming the base of the dictionary of mozc a base in 
http://sourceforge.jp/projects/naist-jdic/? 
Or based on http://sourceforge.jp/projects/ipadic/? 
I don't understand which it is because there is not a license of IPAdic. 
Would you add it?

Original issue reported on code.google.com by iwama...@gmail.com on 5 Aug 2010 at 6:40

@GoogleCodeExporter
Copy link
Author

The credit for the Mozc dictionary is included in the  proprietary version, 
Google Japanese Input.
We are basically using both ipadic and naist-jdic, but the current dictionary 
is mainly based on ipadic.
We found it difficult to use naist-jdic due to the quality issues reported 
below.
http://sourceforge.jp/projects/mecab/lists/archive/users/2010-July/000418.html

I think this issue can be resolved if we fixed the issue 6. If it is OK, i want 
to mark this bug as duplicated.

Original comment by t...@google.com on 11 Aug 2010 at 2:02

@GoogleCodeExporter
Copy link
Author

Thanks for your comments.

> The credit for the Mozc dictionary is included in the  proprietary version, 
Google Japanese Input.
> We are basically using both ipadic and naist-jdic, but the current dictionary 
is mainly based on ipadic.

# you wrote "mainly based on ipadic", not "only ipadic".

I do not know whether you use ipa-doc and naist-jdic of which version. 
Will not the license conflict when you used nasit-jdic after license was 
changed? 
And, how does the license turn out if these two data are mixed?

It is not written whether the dictionary data of mozc were generated by ipadic, 
naist-jdic of which license now. I am glad when you write to README about this.

> We found it difficult to use naist-jdic due to the quality issues reported 
below.
> http://sourceforge.jp/projects/mecab/lists/archive/users/2010-July/000418.html
> 
> I think this issue can be resolved if we fixed the issue 6. If it is OK, i 
want to mark this bug as duplicated.

Original comment by iwama...@gmail.com on 11 Aug 2010 at 8:23

@GoogleCodeExporter
Copy link
Author

I think you might misunderstand the difference between ipadic and naist-jdic.
Although naist-jdic stems from ipadic, these two dictionaries are technically 
different in terms of license. credts_ja.html, included in Google Japanese 
input, has "both" license terms. As far as I know, the license of ipadic and 
naist-jdic had not changed since their initial release. I'm wondering why 
version information and which license is so important in this situation.

Anyway, here's the version of naist-jdic and ipadic Mozc uses
- mecab-naist-jdic-0.4.3-20080917
- mecab-ipadic-2.7.0-20070801 (ipadic-2.7.0)

We will add an extra description to README.txt

Thanks.

Original comment by t...@google.com on 12 Aug 2010 at 5:05

@GoogleCodeExporter
Copy link
Author

Hi Taku-san,
mecab-ipadic is marked as "non-free" and
mecab-naist-jdic is marked as "free".
http://packages.debian.org/search?lang=en&keywords=mecab

I think Iwamatsu-san is a maintainer of the Debian Mozc packages.
http://packages.debian.org/en/squeeze/mozc-server
Debian users love Mozc, but they can't include non-free packages in the Debian 
official ISOs.

I also found Tagoh-san's tweets.
He maintains Red Hat/Fedora Japanese packages.
http://twit411.com/tagoh
> 
mozcの辞書のライセンスの扱いはどうなってるんだろう。ipad
icといっしょ? #mozc
> mozc辞書のライセンス続き: 
Debianのパッケージはmainなのね。
> 
ipadicはnon-freeみたいだけど、mozcのdebian/copyrightみても辞書に�
��触れてないなー #mozc
> data/installer/credits_{en,ja}.htmlにはipadicと naist-jdic両方の
> 
ライセンスが明記されてるのは確認した。つまりどういう��
�と?

Original comment by heathros...@gmail.com on 14 Aug 2010 at 9:42

@GoogleCodeExporter
Copy link
Author

I know that ipadic is marked as "non-free" package. One of the goals of 
naist-jdic is to clear the license issue so that Debian user can include it in 
official package.

We once attempted to use naist-jdic, but, unfortunately, we found that 
naist-jdic has several critical quality issues. In short, many common words 
cannot correctly be analyzed with mecab-naist-jdic. Here's the details:
http://sourceforge.jp/projects/mecab/lists/archive/users/2010-July/000418.html

Also, the quality of naist-jdic  is not stable right now, which made us hard to 
maintain and debug the conversion results and qualities. Once all the quality 
issues of naist-jdic are resolved, we will switch from ipadic to naist-jdic, 
but we don't have any concrete plans yet.

Original comment by t...@google.com on 16 Aug 2010 at 3:18

@GoogleCodeExporter
Copy link
Author

Comment to 3:

Thanks for your comments.
OK, the dictionary data are two states that different licenses are mixed in.
# Maybe, IANAL, I think BSD license with "関連法令に違反しない限
#り、本プログラムそのもの、または本プログラムの変更版�
��第三者へ自由に配
布することができる。" of clause.

BTW, would you teach which file is related to ipadic? dictionary0.txt? 
dictionary1.txt?

Original comment by iwama...@gmail.com on 18 Aug 2010 at 2:38

@GoogleCodeExporter
Copy link
Author

Both dictionary0.txt and dictionary1.txt

We split the entire text data into two files simply because we found that not 
all source code management system can handle our large text dictionary.
We are going to split it into more files.

Original comment by t...@google.com on 24 Aug 2010 at 7:34

@GoogleCodeExporter
Copy link
Author

Hi,

I was wondering how to make mozc 100% free while keeping it BSD license
friendly with minimum efforts. 

My conclusion are:
 * Conversion quality issue of naist-jdic is not directly related to the mecab
   data quality which naist is working on.
 * The small scope of data mined to create naist-jdic causes its content to be
   skewed and incomplete.  This is __the root cause__.
 * Use of ipadic is essentially equivalent of ICOT dictionary + naist-jdic.
   * ICOT dictionary providing more nouns and kanji jukugo.
   * naist-jdic providing grammatical context data
 * We should get as much or even better result using alternative data.
   * edict package is CC-SA license (FREE like BSD) which contains a lot of
     good data although not exactly mecab ready.
     http://packages.debian.org/source/sid/edict
   * edict can provide pronunciation and coarse grammar assignment
   * edict has a huge separate proper name data but 地名 人名 are mixed.
 * I do not know how to sneak edict data yet but looks like dummy low frequency
   value may be better than not having data.

Let me explain why I thought this way.

As I understand, 
 * the quality of mozc conversion using only naist-jdic is not as good as one
    using ipadic.
 * naist-jdic is created by manually removing dictionary data coming from ICOT
 * dictionary data by ICOT is non-free and present in ipadic
 * naist-jdic has updated contents than older ipadic

Although *quality* of naist-jdic is questioned in
http://sourceforge.jp/projects/mecab/lists/archive/users/2010-July/thread.html#4
18
, following the discussion thread made me think a bit on this.  The concern for
「る次」 was raised but it was explained and was actually an improvement.  
Then
http://sourceforge.jp/projects/mecab/lists/archive/users/2010-July/000424.html
was posted.  This was interesting and seemed to directly linked to the
*quality* concern for conversion of mozc. 

The lack of basic words like 季節 奇怪 in naist-jdic as pointed out in the
discussion will certainly degrade conversion.

In order to assess actual situation, the missing words in naist-jdic was
investigated by me.  

mecab-ipadic-2.7.0+20070801
   392,126 data entries with gramatical/statistical data
   173,936 number of uniq base shape words

mecab-naist-jdic-0.6.3-20100801
   485,893 data entries with gramatical/statistical data
   180,943 number of uniq base shape words

So naist-jdic is bigger data with more words.

naist-jdic adds more words than missing words.  Even if some words were missing
in the naist-jdic, they could be found in edict which is much larger 
dictibnary.  
 * compdic    1,2683 computer related words
 * edict     192,345 normal dictionary entries (漢字+読み+文法)
 * enamdict  730,648 proper name dictionary entries(漢字+読み+文法)
 * kanjidic    6,356 の単漢字の 音読と訓読など(not used this time)

Here are some examples of words in this category (only Kanji 熟語 here):

 不可逆 乱高下 会舘 似非 傍迷惑 僻心 内冑 利高 割線 割線法 可読性 否定積
 奇怪 季節 家教 巾偏 帳面面 当事者 憶病 敬啓 数数 時期 棒縞 正反対 無神
 社会民主党 社共 私供 空相場 細石 継端 脱稿 複合 軟論 ...

Since these can be found in edict data, if we figure how to sneak edict data,
these are non-issues. 

So here are the real missing words which I found.  Basically, mecab data being
newspaper article and edict being created by someone in teaching Japanese
language, other fields are sometimes missing but not many.  I may have missed a
few.  I can tell you that I went through all the kanji words with Python script
and did sort uniq diff etc.  So I can say this is almost through list of
missing words.

== COMPLETELY MISSING ===
speciality interest area such as technical words.
    律速 無向 額装 発炎筒

Some archane letter format wordings:
    敬呈 敬啓

historical names:
    交趾 士爵

== PARTIALLY MISSING ==

corporate names:
    フイガロ技研   (「技研」はjdicにもある)
    横河トレーディング (「横河」はjdicにもある)
    (too many to list)

合成語:語根は辞書にある。id.def(接頭辞)に「全」「非��
�はあるが「半」「不」「反」「正」「副」「零」が見当た�
��ない。
    不可逆 非一致 非可分 零交差

合成語:語根は辞書にある。suffix.txtに「型」「形」「波」�
��子(し)」などの技術系接尾詞がない
    零交差波 離散形 電信形

誤字のデーターが省かれた -> 正字は辞書にある!
    不倶載天    (正)不倶戴天
    散慢  (正)散漫

This last ones proves going to naist-jdic is good thing.

osamu@debian.org

Original comment by osamu.a...@gmail.com on 30 Jul 2011 at 6:14

@GoogleCodeExporter
Copy link
Author

There is no problem for MOZC to use IPADIC as below.  So problem solved.

After careful review of IPADIC license, I realize this is free license.  
Complain on "Legal" was found to be baseless claim which should have been 
debated and denied long time ago.  I recently took action on this.  I got 
Debian FTP master to agree with me to accept IPADIC as DFSG complient. Then 
MOZC was accepted as DFSG FREE!  Bravo!  So there is no problem.





Original comment by osamu.a...@gmail.com on 16 Nov 2011 at 2:52

@GoogleCodeExporter
Copy link
Author

Bravo!  I'm very glad to here that.

Thank you very much for taking the effort.

Original comment by koma...@google.com on 16 Nov 2011 at 4:24

  • Changed state: Done

shitamo pushed a commit to shitamo/mozc that referenced this issue Jun 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant