|
|
Home » U++ Library support » RichText,QTF,RTF... » Spell checking on linux
Spell checking on linux [message #26322] |
Sat, 24 April 2010 21:00  |
|
Hi!
Today, I decided that I should use spellchecker when writing a new page for uppweb. I knew there is one, but to my great surprise, it didn't work.
After consultation with Koldo and digging through sources, I found where is the problem: There is no dictionary in Linux releases. Well, no problem - I said to myself - let's download the dictionary from svn or sf.net. Another surprise: there is no English dictionary on sf.net and there is even no dictionaries at all in svn.
So I had to download the win installer and extract the en-us.scd file from there. I put it to ~/.upp/theide where theide is supposed to look for the dictionaries. And here comes the last surprise - it didn't help either.
Finally, after reading the sources more carefully, I noticed it is actually looking for "EN-US.scd" and that makes a great difference on case-sensitive filesystems.
So I have few proposals: - Add basic (at least en-us) dictionaries to the src packages. I'll take care about debian packages.
- Put the files for all available languages on one common place for download (sf.net as it is now is fine, just add English ones please)
- Make the files work corectly on Linux - i.e. either convert the basename to uppercase OR change the code in sGetSpeller() to
...
pp << spell_path << ';' << getenv("LIB") << ';' << getenv("PATH") << ';';
String path = GetFileOnPath(ToLower(LNGAsText(lang)) + ".udc", pp);
if(IsNull(path))
path = GetFileOnPath(ToLower(LNGAsText(lang)) + ".scd", pp);
if(IsNull(path))
return false;
FileIn in(path);
...
- The usual complaint: This should be documented somewhere. If nobody disagrees, I'll put an article to theide docs explaining how to obtain and install additional dictionaries inside theide.
Best regards,
Honza
[Updated on: Sat, 24 April 2010 23:09] Report message to a moderator
|
|
|
|
|
|
Re: Spell checking on linux [message #26326 is a reply to message #26325] |
Sun, 25 April 2010 00:40   |
|
koldo wrote on Sun, 25 April 2010 00:12 | Hello Honza
Perhaps other option would be to include in Download page links to "Speller Dictionaries" folder and to SDL.
|
I actually like this one even better... The files are there already and anyone can just select those which he needs. "Additional resources" section with links on Download page should cover it just fine. I'll do it together with the documentation...
Honza
|
|
|
|
Re: Spell checking on linux [message #26344 is a reply to message #26339] |
Mon, 26 April 2010 13:16   |
|
luzr wrote on Mon, 26 April 2010 10:41 | Also, note that that there is SpellerDictionaries folder on sf.net. Maybe somebody should work on getting more .scd files there 
All you need is a plain text list of all correct words for language, then use uppbox/MakeSpellScd to create .scd file.
|
I'm aware of the sf.net dictionaries, but the selection is rather limited I was already looking for opensource dictionaries that could be converted to scd. The best solution I found so far would be to convert aspell dictionaries. They are GPL licensed, which is not optimal, but better than nothing. I'll try to investigate how to convert them.
A part of the documentation will definitely be a list of available dictionaries with direct download links to sf.net. Also, thinking about it now, it should be rather easy to add a menu item to theide "Fetch spellcheck dictionary..." which would ask for language and try to fetch the file from sf.net and install it.
Honza
|
|
|
Re: Spell checking on linux [message #26377 is a reply to message #26344] |
Wed, 28 April 2010 17:27   |
|
Hi all!
Little update: After a bit of investigation (i.e. reading the manual ) I have found really simple way of converting aspell dictionaries to wordlists in suitable format: #!/bin/bash
DICTS=$( aspell dump dicts )
for d in $DICTS
do
echo "Exporting $d ..."
aspell dump master $d | aspell expand | sed 's/ /\n/g;' > wordlist.$d
done
echo "Finished!"
The above script will ask aspell for a list of installed dictionaries (only master ones, but can be extended to include personal dictionaries as well). Then it cycles through all of them and creates the worldists in current directory.
For list of available languages, see official download page. If I count correctly, there is something around 90 languages which can be easily converted to .scd files.
Now the question is what do we have to do to satisfy the GPL license? Is it enough to mention the dictionaries are GPL-licensed in About-box of theide and on the sf.net download page? I never really understood those licenses :-/
Regards,
Honza
|
|
|
Re: Spell checking on linux [message #26395 is a reply to message #26322] |
Thu, 29 April 2010 22:45   |
|
Another update: I converted 68 dictionaries that are available from ubuntu repositories. Let's call that a standard selection The dictionaries contain data varying from poor to excellent, but I was to lazy to go through the documentation and select only the good ones. Also bad dictionary is better than none 
I won't upload them anywhere just yet since I'm still unsure about the licensing. If someone wishes to test his language, just contact me. Here comes the list, the second column is filesize:
af-af.udc 428K
am-am.udc 137
ar-ar.udc 364
bg-bg.udc 231K
bn-bn.udc 467
br-br.udc 62K
ca-ca.udc 745K
cs-cs.udc 3,5M
cy-cy.udc 180K
da-da.udc 388K
de-at.udc 474K
de-de.udc 473K
de-ch.udc 472K
el-el.udc 1,5K
en-ca.udc 226K
en-gb.udc 226K
en-us.udc 226K
eo-eo.udc 158K
es-es.udc 459K
et-et.udc 556K
eu-eu.udc 408K
fa-fa.udc 1,2K
fi-fi.udc 349K
fo-fo.udc 241K
fr-fr.udc 375K
fr-ch.udc 375K
ga-ga.udc 402K
gl-gl.udc 67K
gu-gu.udc 398
he-he.udc 1,8K
hi-hi.udc 410
hr-hr.udc 368K
hu-hu.udc 4,8M
hy-hy.udc 559
is-is.udc 457K
it-it.udc 346K
ku-ku.udc 17K
lt-lt.udc 567K
lv-lv.udc 548K
ml-ml.udc 711
mr-mr.udc 406
nb-nb.udc 1,2M
nl-nl.udc 605K
nn-nn.udc 810K
no-no.udc 1,2M
nr-nr.udc 44K
ns-ns.udc 21K
or-or.udc 115
pa-pa.udc 126
pl-pl.udc 1,8M
pt-br.udc 1,4M
pt-pt.udc 106K
ro-ro.udc 727K
ru-ru.udc 8,6K
sk-sk.udc 793K
sl-sl.udc 582K
ss-ss.udc 64K
st-st.udc 21K
sv-sv.udc 139K
ta-ta.udc 211
te-te.udc 550
tl-ph.udc 43K
tn-tn.udc 31K
ts-ts.udc 67K
uk-uk.udc 6,1K
uz-uz.udc 462
xh-xh.udc 56K
zu-zu.udc 203K
|
|
|
|
Re: Spell checking on linux [message #26404 is a reply to message #26395] |
Fri, 30 April 2010 08:10   |
Mindtraveller
Messages: 917 Registered: August 2007 Location: Russia, Moscow rgn.
|
Experienced Contributor |

|
|
el-el.udc 1,5K
fa-fa.udc 1,2K
he-he.udc 1,8K
ku-ku.udc 17K
nr-nr.udc 44K
ns-ns.udc 21K
ru-ru.udc 8,6K
uk-uk.udc 6,1K
Looks like some langauge packs are almost empty. Did you try to take OpenOffice spellchecking files?
[Updated on: Fri, 30 April 2010 08:11] Report message to a moderator
|
|
|
Re: Spell checking on linux [message #26406 is a reply to message #26322] |
Fri, 30 April 2010 09:40   |
mr_ped
Messages: 826 Registered: November 2005 Location: Czech Republic - Praha
|
Experienced Contributor |
|
|
I'm not sure how to deal with the GPL in dictionaries.
The mingw was included with GPL, although I'm afraid that's not ok (unless mingw has some exception), and I'm afraid the dictionaries would taint the whole package with GPL.
It would be easy to go around this by including just some wizard script (under BSD), which would download and convert dictionaries for the user who run it (I think this is trivial workaround for any programmer, so it would suit well our target users), but you hit the GPL issue again later when deploying the app. If you would include spell-check+dicts, you would have either to use such workaround even for final user (may work not as well as for u++ users), or be sure how to distribute the GPL dicts properly under their license. (it would be very likely OK to make them available as additional download on your site, so every user will get BSD package as deployment, and he can download any GPL stuff separately later on his own. I'm not sure if it's possible to create a legal bundle of both, I would say "no" unless you are very clever and creative.)
|
|
|
Re: Spell checking on linux [message #26412 is a reply to message #26406] |
Fri, 30 April 2010 10:55   |
 |
mirek
Messages: 14255 Registered: November 2005
|
Ultimate Member |
|
|
mr_ped wrote on Fri, 30 April 2010 03:40 | I'm not sure how to deal with the GPL in dictionaries.
The mingw was included with GPL, although I'm afraid that's not ok (unless mingw has some exception), and I'm afraid the dictionaries would taint the whole package with GPL.
It would be easy to go around this by including just some wizard script (under BSD), which would download and convert dictionaries for the user who run it (I think this is trivial workaround for any programmer, so it would suit well our target users), but you hit the GPL issue again later when deploying the app. If you would include spell-check+dicts, you would have either to use such workaround even for final user (may work not as well as for u++ users), or be sure how to distribute the GPL dicts properly under their license. (it would be very likely OK to make them available as additional download on your site, so every user will get BSD package as deployment, and he can download any GPL stuff separately later on his own. I'm not sure if it's possible to create a legal bundle of both, I would say "no" unless you are very clever and creative.)
|
I guess for starters, putting them to sf.net and not making them the part of release would do.
I guess nobody wants to download all of them anyway.
Mirek
|
|
|
|
Re: Spell checking on linux [message #26421 is a reply to message #26322] |
Fri, 30 April 2010 13:53   |
|
Hi all!
Koldo: They are all aspell dictionaries, acquired via ubuntu repositories. All of them are GPL. I didn't look at openoffice files at all.
Mindtraveller: Yes, as I mentioned before, the quality varies. Most of the small ones (with exception of russian and ukraine) are not widely used, so there was probably not much effort to build the dictionaries. If anyone can supply better wordlist, we can substitute them.
Also, the hungarian file looked suspicious from the other side it is too big - there were too much words. To compare: 12M in hungarian, compared to 4M for czech. Most of them looked like concatenated from two words, with common prefixes, but not even google could find any of these suspicious words... Anyone capable of checking them?
Mr. Ped: I was thinking about similar approach. Putting them to sf.net with GPL license and writing a download wizard into theide. As you said the problem is with end users who want to redistribute them with non-GPL apps. I guess they would have to reuse the wizard or supply their own dictionaries.
Mirek: Yes, it worked without problems. Also, since I got the wordlist directly from aspell program, it was all in UTF-8 and I didn't have to bother with the charset conversion. The entire process went smooth and fully automatically. Only exception was the hungarian - it choked my computer, 512 MB RAM was not enough, so I had to convert this one on different computer, but that is not a fault in algorithm but in my hardware 
Also one more thing: I'm not sure about some of the languages codes, because in aspell the were designated with only two letter code, like "cs". For the sake of automatizaion, I expended it in such cases presuming the second part is the same as first, but in some cases it is wrong, e.g. cs-cs should be cs-cz... This should be checked/solved before making them available for download.
Best regards,
Honza
|
|
|
Re: Spell checking on linux [message #26436 is a reply to message #26322] |
Sat, 01 May 2010 22:05   |
|
Very interesting... The dictionaries with small sizes are actually not that wrong as you assumed. I checked the wordcounts:
407752 wordlist.el
339747 wordlist.fa
455264 wordlist.he
14268 wordlist.ku
12497 wordlist.nr
6234 wordlist.ns
1029 wordlist.or
2045 wordlist.pa
732571 wordlist.ru
181779 wordlist.uk
For comparison:
417350 wordlist.bg
4669281 wordlist.cs
307891 wordlist.da
135275 wordlist.en_US
629569 wordlist.fr_FR
12939123 wordlist.hu
48490 wordlist.pt_PT
859141 wordlist.sk_SK
After loking at those numbers for a while, I got an idea, that some of the wordlists may contain multiple entries for a single word. So I ran some of them through sort -u and here is what I got
135275 wordlist.en_US
135275 wordlist.en_US.sorted
732571 wordlist.ru
1236 wordlist.ru.sorted
4669281 wordlist.cs
4269350 wordlist.cs.sorted
Conclusion:
The Russian dictionary is very poor and the same might be true for other ones on "suspicous list". So someone will have to do some quality control... I can check the word counts, but it would be nice if someone who actually speaks the given language checked it after me.
Regards,
Honza
|
|
|
Re: Spell checking on linux [message #26614 is a reply to message #26322] |
Sun, 16 May 2010 00:17   |
|
Hi,
Here are the dictionaries, together with the original sources. Can someone put them to sf.net? Or I can do it myself, if someone grants me the rights (my sf.net account is dolik_rce).
Also, I propose to make a little change to the function finding the files: Speller *sGetSpeller(int lang)
{
static ArrayMap<int, Speller> speller;
int q = speller.Find(lang);
if(q < 0) {
String pp;
String dir = ConfigFile("scd");
for(;;) {
pp << dir << ';';
String d = GetFileFolder(dir);
if(d == dir) break;
dir = d;
}
pp << spell_path << ';' << getenv("LIB") << ';' << getenv("PATH") << ';';
String path = GetFileOnPath(ToLower(LNGAsText(lang)) + ".udc", pp);
if(IsNull(path))
path = GetFileOnPath(ToLower(LNGAsText(lang)) + ".scd", pp);
if(IsNull(path)) // This is
path = GetFileOnPath(ToLower(LNGAsText(lang).Left(2)) + ".udc", pp); // added
if(IsNull(path))
return false;
FileIn in(path);
if(!in)
return false;
... It will allow the dictionaries to have just the language part of name (i.e. no country code). This way, even if the user selects country which does not have specific dictionary yet (e.g. EN-NZ) he still gets a spell checker for given language (that is en.udc for the New Zealand exmple). IMHO it is better to serve a more general dictionary than nothing.
Best regards,
Honza
|
|
|
Re: Spell checking on linux [message #26615 is a reply to message #26614] |
Sun, 16 May 2010 08:55   |
 |
mirek
Messages: 14255 Registered: November 2005
|
Ultimate Member |
|
|
dolik.rce wrote on Sat, 15 May 2010 18:17 | Hi,
Here are the dictionaries, together with the original sources. Can someone put them to sf.net? Or I can do it myself, if someone grants me the rights (my sf.net account is dolik_rce).
|
Rights granted.
Quote: |
Also, I propose to make a little change to the function finding the files: Speller *sGetSpeller(int lang)
{
static ArrayMap<int, Speller> speller;
int q = speller.Find(lang);
if(q < 0) {
String pp;
String dir = ConfigFile("scd");
for(;;) {
pp << dir << ';';
String d = GetFileFolder(dir);
if(d == dir) break;
dir = d;
}
pp << spell_path << ';' << getenv("LIB") << ';' << getenv("PATH") << ';';
String path = GetFileOnPath(ToLower(LNGAsText(lang)) + ".udc", pp);
if(IsNull(path))
path = GetFileOnPath(ToLower(LNGAsText(lang)) + ".scd", pp);
if(IsNull(path)) // This is
path = GetFileOnPath(ToLower(LNGAsText(lang).Left(2)) + ".udc", pp); // added
if(IsNull(path))
return false;
FileIn in(path);
if(!in)
return false;
... It will allow the dictionaries to have just the language part of name (i.e. no country code). This way, even if the user selects country which does not have specific dictionary yet (e.g. EN-NZ) he still gets a spell checker for given language (that is en.udc for the New Zealand exmple). IMHO it is better to serve a more general dictionary than nothing.
|
OK, byt do we really have those "more general" dictionaries?
Maybe we should scan for something like "en*.udc" instead?
Mirek
|
|
|
Re: Spell checking on linux [message #26616 is a reply to message #26615] |
Sun, 16 May 2010 09:33  |
|
luzr wrote on Sun, 16 May 2010 08:55 | OK, byt do we really have those "more general" dictionaries?
Maybe we should scan for something like "en*.udc" instead?
|
Hi Mirek,
Thanks for the rights, I'll upload it today.
This idea is based on the Aspell dictionaries. Most of them have only the general form. Some, like en, are to my best knowledge combination of all words from all english dialects. There is only single case where the general dictionary does not exist (pt-pt,pt-br, no pt).
Maybe we could scan for en*.udc as a last resort after checking for the general dictionary, if the general does not exist. But if someone asks for Angolan Portuguese, how would you decide if it is better to use pt-pt or pt-br?
Honza
|
|
|
Goto Forum:
Current Time: Fri Apr 25 12:43:42 CEST 2025
Total time taken to generate the page: 0.00952 seconds
|
|
|