Single Transliteration Scheme for all CM Languages - Part 2
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
mahakavi,
I am not sure but there are a few variables:
1. The email program you used to send it. It should be able to send the content as "unicode". If it supports sending in HTML, i would presume it does
2. Which email program she used to read it. Again it should be able to read it, AND apply the correct font for the tamizh unicode text (browsers nowadays do this automatically). Of course this also depends on if a font that supports tamizh unicode is installed (nowadays this is not that much of a problem).
If you send it from a webmail account in HTML to another webmail account, i would have expected it to work. Let me do a bit of testing and see whats going on.
Arun
I am not sure but there are a few variables:
1. The email program you used to send it. It should be able to send the content as "unicode". If it supports sending in HTML, i would presume it does
2. Which email program she used to read it. Again it should be able to read it, AND apply the correct font for the tamizh unicode text (browsers nowadays do this automatically). Of course this also depends on if a font that supports tamizh unicode is installed (nowadays this is not that much of a problem).
If you send it from a webmail account in HTML to another webmail account, i would have expected it to work. Let me do a bit of testing and see whats going on.
Arun
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
mahakavi,
i sent mail from my yahoo account to myself (i.e. yahoo) and also another account for which outlook is the reader.
If i send mail but my compose settings say "send as plain text", the mail on receipt does appear in tamizh but words in a single line are split across multiple lines.
After I fix the compose settings to say "send with colors and graphics" (i.e. html/rich-text), it appeared fine on receipt (both yahoo and outlook).
Find out what email program your friend uses and/or send it to a webmail account or an account that uses Outlook as reader.
Arun
i sent mail from my yahoo account to myself (i.e. yahoo) and also another account for which outlook is the reader.
If i send mail but my compose settings say "send as plain text", the mail on receipt does appear in tamizh but words in a single line are split across multiple lines.
After I fix the compose settings to say "send with colors and graphics" (i.e. html/rich-text), it appeared fine on receipt (both yahoo and outlook).
Find out what email program your friend uses and/or send it to a webmail account or an account that uses Outlook as reader.
Arun
Last edited by arunk on 01 Feb 2007, 20:46, edited 1 time in total.
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
i spoke too soon
Send as plain text: Looks correct on Outlook, but wrong (in tamizh but words in a single line are split across multiple lines) on Yahoo
Send as HTML: Looks correct on Yahoo, but wrong (not even in tamizh, they show up as HTML uncode entities (e.g. அ for tamizh "a") on Outlook.
So depending on which account you send to, you need to send it differently . What a wacky world!
Arun
Send as plain text: Looks correct on Outlook, but wrong (in tamizh but words in a single line are split across multiple lines) on Yahoo
Send as HTML: Looks correct on Yahoo, but wrong (not even in tamizh, they show up as HTML uncode entities (e.g. அ for tamizh "a") on Outlook.
So depending on which account you send to, you need to send it differently . What a wacky world!
Arun
-
- Posts: 13754
- Joined: 02 Feb 2010, 22:26
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
arunk:
When I send the Thamizh text from roadrunner to roadrunner(myself) email where I use Outlook express, the message reads fine. But hotmail to hotmail or roadrunner it is all a lot of numbers. Hotmail to Yahoo the fidelity of the text is preserved but as you said the text gets split into numerous lines.
When I send from roadrunner to yahoo, or hotmail addresses (using unicode format), it is again mumbo-jumbo stuff different from the previous gibberish numbers.
Well I'll leave it there. Don't bother to resolve it unless you get to resolve it by sheer luck.
When I send the Thamizh text from roadrunner to roadrunner(myself) email where I use Outlook express, the message reads fine. But hotmail to hotmail or roadrunner it is all a lot of numbers. Hotmail to Yahoo the fidelity of the text is preserved but as you said the text gets split into numerous lines.
When I send from roadrunner to yahoo, or hotmail addresses (using unicode format), it is again mumbo-jumbo stuff different from the previous gibberish numbers.
Well I'll leave it there. Don't bother to resolve it unless you get to resolve it by sheer luck.
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
divakar - pl. see http://arunk.freepgs.com/cmtranslit (and also threads under Languages section here on the forum).
ravi - If hindi uses candrabindu for pa#nkaja (pa~nca), but sanskrit uses anuswara, then only way would be to support hindi as a separate language. This is of course possible but strictly speaking would throw a wrinkle in the nomenclature of the scheme . But there are some issues. Does hindi use candrabindu for words like mAm (i.e. words ending in "m")? Another wrinkle is words like yU#n may quite difficult to represent in other languages unless we use qualifiers. The trouble is in non-tamizh scripts, successive consonants can be combined into single glyphs, and that makes qualifiers harder atleast in unicode representation.
Also, for words like yU#n is the ending sound really supposed to represent the character #n here?
Sorry if those questions dont make sense. Yet again coming across a language which i dont know well
Arun
ravi - If hindi uses candrabindu for pa#nkaja (pa~nca), but sanskrit uses anuswara, then only way would be to support hindi as a separate language. This is of course possible but strictly speaking would throw a wrinkle in the nomenclature of the scheme . But there are some issues. Does hindi use candrabindu for words like mAm (i.e. words ending in "m")? Another wrinkle is words like yU#n may quite difficult to represent in other languages unless we use qualifiers. The trouble is in non-tamizh scripts, successive consonants can be combined into single glyphs, and that makes qualifiers harder atleast in unicode representation.
Also, for words like yU#n is the ending sound really supposed to represent the character #n here?
Sorry if those questions dont make sense. Yet again coming across a language which i dont know well
Arun
Last edited by arunk on 02 Feb 2007, 22:13, edited 1 time in total.
-
- Posts: 13754
- Joined: 02 Feb 2010, 22:26
Arun,For the most part Hindi and Sanskrit use the same style for forming words...it is the urdU words that may make Hindi different...for instance, when Om is written in sanskrit, a chandrabindu is used...that is what I mean. For pa~nca, pan#nkaja, a bindu will suffice, but for Ankh, or yUn, a candrabindu is needed. I am not a very 'rules' oriented speller, but instinctively get hindi spelt correct (don't ask me why or how, becuase in all other languages, I have learnt to ignore my instincts to get the spelling right!)...
-
- Posts: 4066
- Joined: 26 Mar 2005, 17:01
Ravi
Sanskrit and hindi do not follow the same pattern of writing their words. A bindu will not suffice in sanskrit for words like pa#nkaja, pa~nca. The letters have to be clearly shown in the conjunct.
Hindi is a lot like kannaDa, telugu and many others in that respect. the spellings are simplified.
Sanskrit and hindi do not follow the same pattern of writing their words. A bindu will not suffice in sanskrit for words like pa#nkaja, pa~nca. The letters have to be clearly shown in the conjunct.
Hindi is a lot like kannaDa, telugu and many others in that respect. the spellings are simplified.
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
Then at least one book i have doesnt follow this convention as it is using the bindu. May be it is following hindi rules or perhaps a convention while not kosher is ok to many people (my wife who has learned sanskrit didnt seemed bothered by it)drshrikaanth wrote:A bindu will not suffice in sanskrit for words like pa#nkaja, pa~nca. The letters have to be clearly shown in the conjunct.
It is the book on Syama Sastry krithis by Smt. Vidya Sankar (has devanagiri, telugu, tamil and english). I can check other books.
A possibility is to make it an option.
Arun
-
- Posts: 4066
- Joined: 26 Mar 2005, 17:01
-
- Posts: 4066
- Joined: 26 Mar 2005, 17:01
See post 127 and around it. Same logic holds here as well. There may have been another discussion as well about this. You maybe abe to fiind it
http://rasikas.org/forums/viewtopic.php?pid=27698#p27698
http://rasikas.org/forums/viewtopic.php?pid=27698#p27698
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
its possible we discussed this but the case of "M" at end is what I remembered and it is implemented (i.e. without anuswara for sanskrit).
I did the anuswara for panca etc. based on that book i was talking about. But i also vaguely remember seeing other sources like http://carnatica.net/lyrics/ooth9.pdf, where anuswara is used at the end (!) but not in nca/cha etc. (2nd krithi). There doesnt seem any consistency - atleast thats what I thought.
Once I put sanskrit logic, I had asked people several times to point out any errors so that i can fix the logic after I put it up. I didnt hear a peep. Perhaps they assumed i wasnt listening or incapable of listening
Arun
I did the anuswara for panca etc. based on that book i was talking about. But i also vaguely remember seeing other sources like http://carnatica.net/lyrics/ooth9.pdf, where anuswara is used at the end (!) but not in nca/cha etc. (2nd krithi). There doesnt seem any consistency - atleast thats what I thought.
Once I put sanskrit logic, I had asked people several times to point out any errors so that i can fix the logic after I put it up. I didnt hear a peep. Perhaps they assumed i wasnt listening or incapable of listening
Arun
-
- Posts: 4066
- Joined: 26 Mar 2005, 17:01
You still have doubts about "m" occurring at the end!:rolleyes:arunk wrote:But i also vaguely remember seeing other sources like http://carnatica.net/lyrics/ooth9.pdf, where anuswara is used at the end (!) but not in nca/cha etc. (2nd krithi). There doesnt seem any consistency - atleast thats what I thought.
Wish I had all the time in the world(And no job) to answer you queriesOnce I put sanskrit logic, I had asked people several times to point out any errors so that i can fix the logic after I put it up. I didnt hear a peep.
Its the same logic in midde as well. "Nearly" Always show the vyanjanas explicity even when the conjunct has a nasal consonant as the 1st half. "Nearly" beacusem there are some exceptions like samyukta where "sam" is a prefix to an otherwise independent word(yukta in this case). This means samga will not be saMga. saMsarga, saMyukta, saMtOSha, saMgIta, saMgAna etc yes but Not saMga, saMkaTa, etc. Note here that tOSha, yukta, gIta and gAna are independent words witha saM suffix but Noy ga, kaTa,
Likewise words with "kAra" suffix like ahaMkAra, jhaMkAra will feature the bindu only, not the consonant itself. I am not sure if there are exceptions to this. My thinking tells me other "suffixes" like cAra will also behave similarly. Basically, if they are one unit and form an integral part of the word to make sense, use consonant. If added as suffix or prefix, use anuswAra/bindu.
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
i guess i do now . Doesnt that pdf file use anuswara at the end (e.g. santatam aham)? My point is whatever the correct rules are, in practice (i am guessing owing to hindi's popularity), there are variations (?)drshrikaanth wrote:You still have doubts about "m" occurring at the end!:rolleyes:
Please also check my other post in languages thread in response to rules you mention
Arun
-
- Posts: 4066
- Joined: 26 Mar 2005, 17:01
Arun
We have dissussed at length about m/M use in the end. We aso discussed the reasons for variations- not necessarily hindi's popuarity but because of the influence of spelling in one's mother tongue. If you still have doubts, Iam not responsible for it. I dont have doubts in this matter at least.
We have dissussed at length about m/M use in the end. We aso discussed the reasons for variations- not necessarily hindi's popuarity but because of the influence of spelling in one's mother tongue. If you still have doubts, Iam not responsible for it. I dont have doubts in this matter at least.
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
DRS said the following in another thread regarding rules as to when anuswara appears in sanskrit in the middle of words:
What this tells me is that for my logic, it would be best if I force anuswara for sanskrit, only in the middle and only if explicitly specified as M and let people specify it judiciously (i.e. it would be too difficult for the logic to know which is one unit vs. suffix etc)..
But for languages like kannada, telugu when preceding k(h)a, g(h)a, c(h)a, j(h)a (and others), the anuswara always figures right? So this would mean that specifying M in the middle for stuff should be used judiciously even when entering for other languages - should be used ONLY if it is an anuswara in sanskrit, otherwise sanskrit rendition would be screwed up. This is certainly a big wrench since a person entering telugu or kannada, and even worse tamil may have no idea about these rules in sanskrit.
This also means that for such words "phonetically better variant in english" would be wrong and cannot be used (i.e. never sangIta, always saMgIta)
This allows me to ask a question which i have add ever since i was exposed to it: What is the purpose behind the answara? It seems they represent some other sound for which a character does exist in the script? Why then not use the character itself?
(or may be i should retire to a "less than perfect" sanskrit rendition - i.e. always use anuswara or never use anuswara)
Arun
Unless I am mistaken, things got a bit complicated now.Its the same logic in midde as well. "Nearly" Always show the vyanjanas explicity even when the conjunct has a nasal consonant as the 1st half. "Nearly" beacusem there are some exceptions like samyukta where "sam" is a prefix to an otherwise independent word(yukta in this case). This means samga will not be saMga. saMsarga, saMyukta, saMtOSha, saMgIta, saMgAna etc yes but Not saMga, saMkaTa, etc. Note here that tOSha, yukta, gIta and gAna are independent words witha saM suffix but Noy ga, kaTa
Likewise words with "kAra" suffix like ahaMkAra, jhaMkAra will feature the bindu only, not the consonant itself. I am not sure if there are exceptions to this. My thinking tells me other "suffixes" like cAra will also behave similarly. Basically, if they are one unit and form an integral part of the word to make sense, use consonant. If added as suffix or prefix, use anuswAra/bindu.
What this tells me is that for my logic, it would be best if I force anuswara for sanskrit, only in the middle and only if explicitly specified as M and let people specify it judiciously (i.e. it would be too difficult for the logic to know which is one unit vs. suffix etc)..
But for languages like kannada, telugu when preceding k(h)a, g(h)a, c(h)a, j(h)a (and others), the anuswara always figures right? So this would mean that specifying M in the middle for stuff should be used judiciously even when entering for other languages - should be used ONLY if it is an anuswara in sanskrit, otherwise sanskrit rendition would be screwed up. This is certainly a big wrench since a person entering telugu or kannada, and even worse tamil may have no idea about these rules in sanskrit.
This also means that for such words "phonetically better variant in english" would be wrong and cannot be used (i.e. never sangIta, always saMgIta)
This allows me to ask a question which i have add ever since i was exposed to it: What is the purpose behind the answara? It seems they represent some other sound for which a character does exist in the script? Why then not use the character itself?
(or may be i should retire to a "less than perfect" sanskrit rendition - i.e. always use anuswara or never use anuswara)
Arun
Last edited by arunk on 03 Feb 2007, 03:05, edited 1 time in total.
-
- Posts: 1876
- Joined: 04 Feb 2010, 02:05
arun - I have not experimented with the export feature yet.
I found one problem with the variables. Or I may not have understood how to use it
1. If I type as caraNam -then the kannaDa transliteration should show it as caraNa. Right? But that is not happening. It does show up as caraNam, with a bindu at the end
2. The key word is not recognized as a variable at all sometimes - even though the spelling is correct.
-Ramakriya
I found one problem with the variables. Or I may not have understood how to use it
1. If I type as caraNam -then the kannaDa transliteration should show it as caraNa. Right? But that is not happening. It does show up as caraNam, with a bindu at the end
2. The key word is not recognized as a variable at all sometimes - even though the spelling is correct.
-Ramakriya
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
variables are experimental.
#1. It shows up as caraNam because I seemed to have (incorrectly) defined it as such. I need some help in knowing this (for all). I know drs gave the kannada equivalents, i need to go and incorporate them.
#2: Even when you click on a word and hit the "$" button? If so, can you give an example? If it is on "convert all" (i.e. 3 arrows pointin to $ button), then it is on purpose. I didnt want to mistakenly convert words in the sAhitya portion and thus am extra careful in looking for certain patterns.
Arun
#1. It shows up as caraNam because I seemed to have (incorrectly) defined it as such. I need some help in knowing this (for all). I know drs gave the kannada equivalents, i need to go and incorporate them.
#2: Even when you click on a word and hit the "$" button? If so, can you give an example? If it is on "convert all" (i.e. 3 arrows pointin to $ button), then it is on purpose. I didnt want to mistakenly convert words in the sAhitya portion and thus am extra careful in looking for certain patterns.
Arun
Last edited by arunk on 03 Feb 2007, 03:00, edited 1 time in total.
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
Unless my assumptions/conclusions are yet again wrong, i am thinking of doing the following
1. Change the word "Sanskrit" to "Devanagiri" as it appears on the editor. This is mainly to indicate that generate script may not be considered proper Sanskrit as all written rules are not followed
2. Have 2 anuswara options for devanagiri:
(i) Always generate (so more like Hindi)
(ii) Never generate (closer to Sanskrit but not that close=> words like sangIta would be all messed up).
I dont know if this salvages the situation enough. Also I dont know if option 2(ii) is that useful as it would be a mixed bag (neither hindi like nor sanskrit like)
Suggestions?
Arun
1. Change the word "Sanskrit" to "Devanagiri" as it appears on the editor. This is mainly to indicate that generate script may not be considered proper Sanskrit as all written rules are not followed
2. Have 2 anuswara options for devanagiri:
(i) Always generate (so more like Hindi)
(ii) Never generate (closer to Sanskrit but not that close=> words like sangIta would be all messed up).
I dont know if this salvages the situation enough. Also I dont know if option 2(ii) is that useful as it would be a mixed bag (neither hindi like nor sanskrit like)
Suggestions?
Arun
Last edited by arunk on 03 Feb 2007, 03:30, edited 1 time in total.
-
- Posts: 1317
- Joined: 30 Jun 2006, 03:08
Arun - the bindu at the end is how I know, based on my Sanskrit classes in school and college. Usage of M seems to be a variation, sometimes for aesthetics. If you read thru No.2 kriti (vAnchasi yadi) in the pdf file, you will find occurrence of both m and M for the word kuSalam/kuSalaM. To make it simple for yourself, I would suggest you go with the bindu version.
Btw, the way they have written rAgaM and tAlaM is jarring, at least to my eyes!
Also, 'ambika' (as in kamalAmbika) is not written with bindu, the half-consonant is used. At least this is the way I have read and written all these years.
Btw, the way they have written rAgaM and tAlaM is jarring, at least to my eyes!
Also, 'ambika' (as in kamalAmbika) is not written with bindu, the half-consonant is used. At least this is the way I have read and written all these years.
-
- Posts: 1317
- Joined: 30 Jun 2006, 03:08
And DRS is correct in saying that one's mother tongue has an influence on how these are written in Sanskrit. Coming from a Kerala background, I was taught to use the half-consonants instead of the bindu in most cases (within words). Malayalam follows similar rules.
The Namboodiris of Kerala are reputed to have the 'most authentic' knowledge of Sanskrit, so obviously I had assumed we were taught the most accurate version!
(finally, perhaps we should move this language discussion to where it belongs - arun's thread!
Let OP-ji rest in peace!)
The Namboodiris of Kerala are reputed to have the 'most authentic' knowledge of Sanskrit, so obviously I had assumed we were taught the most accurate version!
(finally, perhaps we should move this language discussion to where it belongs - arun's thread!
Let OP-ji rest in peace!)
Last edited by jayaram on 03 Feb 2007, 03:54, edited 1 time in total.
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
I found this link which talks about anuswaras in context of sandhi rules:
http://www.sanskrit-sanscrito.com.ar/en ... rules.html. It talks about when "m" at end of word becomes anuswara and when it does not. Basically if it is followed by a word that begins with a consonant.
This seems to be followed here: http://sanskrit.safire.com/pdf/DURGA700.pdf, where you have cases where "m" at end is rendered as consonant, and also cases where you have it as anuswara. You see it at an anuswara at end a "line"/"sentence" (so no word to follow and hence no consonant to follow) i.e. before a | or ||. For example, the title itself, first line on the right side, and also several other places. You see the bindu used "within a line/sentence". The cases of bindu inside words is much much rarer (but is there on page 6 - "saMhati..."(?), also on page 12 - saMyugE (?)), and that is of course what drs said.
Of course I dont know how official/authentic these are but atleast I wanted to see some reasoning behind the "mixture of bindu and no bindu cases" - and I see it now.
Now the rule for end of word within a sentence and followed by a consonant is something that is possible to program.
The trouble is when bindu occurs in the middle depends on interpretation of words etc. and not possible to program without an elaborate setup with look ups to dictionary and such.
So I think we are still down to either
(a) use it like telugu and kannada, and hindi. (i.e. always use it).
(b) or not use it.
(c): use it only at end (i.e. following end of word rule above) but never in the middle.
Of course all of them are not correct for Sanskrit, but I am guessing/hoping that
(a) would be ok for people to read (as they may apply their native language rules).
(c) looks like closer to sanskrit and ma....y be passable although it will definitely messup words that drs mentioned.
if (c) is done at all, (b) is useless
Can people pl. chime in and give me advice on whether (a) is ok, and whether i should even bother with (c)?
Thanks
Arun
http://www.sanskrit-sanscrito.com.ar/en ... rules.html. It talks about when "m" at end of word becomes anuswara and when it does not. Basically if it is followed by a word that begins with a consonant.
This seems to be followed here: http://sanskrit.safire.com/pdf/DURGA700.pdf, where you have cases where "m" at end is rendered as consonant, and also cases where you have it as anuswara. You see it at an anuswara at end a "line"/"sentence" (so no word to follow and hence no consonant to follow) i.e. before a | or ||. For example, the title itself, first line on the right side, and also several other places. You see the bindu used "within a line/sentence". The cases of bindu inside words is much much rarer (but is there on page 6 - "saMhati..."(?), also on page 12 - saMyugE (?)), and that is of course what drs said.
Of course I dont know how official/authentic these are but atleast I wanted to see some reasoning behind the "mixture of bindu and no bindu cases" - and I see it now.
Now the rule for end of word within a sentence and followed by a consonant is something that is possible to program.
The trouble is when bindu occurs in the middle depends on interpretation of words etc. and not possible to program without an elaborate setup with look ups to dictionary and such.
So I think we are still down to either
(a) use it like telugu and kannada, and hindi. (i.e. always use it).
(b) or not use it.
(c): use it only at end (i.e. following end of word rule above) but never in the middle.
Of course all of them are not correct for Sanskrit, but I am guessing/hoping that
(a) would be ok for people to read (as they may apply their native language rules).
(c) looks like closer to sanskrit and ma....y be passable although it will definitely messup words that drs mentioned.
if (c) is done at all, (b) is useless
Can people pl. chime in and give me advice on whether (a) is ok, and whether i should even bother with (c)?
Thanks
Arun
Last edited by arunk on 03 Feb 2007, 05:11, edited 1 time in total.
-
- Posts: 1317
- Joined: 30 Jun 2006, 03:08
Arun - I get the feeling if you go with option (a) for Devanagari, we may do the same for Malayalam! And it does look a bit weird if this option is used in Malayalam, at least for old-timers like myself.
My own take on this:
1. ok to use bindu across the board for the endings. as i said earlier, the M ending is for aesthetics, don't believe there's a rigid rule for this.
2. use half-consonant within a word using the appropriate rules - tough to implement, I agree, but at least this can be done for certain often-occurring words, perhaps you could look thru Dikshitar kritis for words such as 'ambika': http://www.rogepost.com/n/4405894335
My own take on this:
1. ok to use bindu across the board for the endings. as i said earlier, the M ending is for aesthetics, don't believe there's a rigid rule for this.
2. use half-consonant within a word using the appropriate rules - tough to implement, I agree, but at least this can be done for certain often-occurring words, perhaps you could look thru Dikshitar kritis for words such as 'ambika': http://www.rogepost.com/n/4405894335
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
yes jayaram it would be less than ideal for malayalam - that is not good either.
I will try the more difficult approach. For sanskrit (and malayalam too?), as drs indicated, the # of cases which DONT employ bindu in the middle of the word outnumber the cases where it does. So I could build up a database of known words that do employ bindu and use smart matching. So by default no bindu except for these known words. This will handle amba etc. correctly by default. It will also handle sangIta, santOsha (assuming they are in database).
On top of that, it may be possible to introduce a feature in the editor (not the scheme), to force use of bindu in sanskrit/malayalam for a specific word. So with a combination of this and the database of known words, we may be able to get things right. Although unless the database of known words is good (so that it takes care of almost all common cases of occurences in kriti), it would be a pain for the user to have to spoon feed the editor.
I will look into this.
Thanks
Arun
I will try the more difficult approach. For sanskrit (and malayalam too?), as drs indicated, the # of cases which DONT employ bindu in the middle of the word outnumber the cases where it does. So I could build up a database of known words that do employ bindu and use smart matching. So by default no bindu except for these known words. This will handle amba etc. correctly by default. It will also handle sangIta, santOsha (assuming they are in database).
On top of that, it may be possible to introduce a feature in the editor (not the scheme), to force use of bindu in sanskrit/malayalam for a specific word. So with a combination of this and the database of known words, we may be able to get things right. Although unless the database of known words is good (so that it takes care of almost all common cases of occurences in kriti), it would be a pain for the user to have to spoon feed the editor.
I will look into this.
Thanks
Arun
Last edited by arunk on 03 Feb 2007, 20:10, edited 1 time in total.
-
- Posts: 4066
- Joined: 26 Mar 2005, 17:01
Forget about doing this Arun as the list of words will stretch to several thousands! I just checked. The way out would be to link up with a pre-existing onine dictionary and match with that spelling.arunk wrote:For sanskrit (and malayalam too?), as drs indicated, the # of cases which DONT employ bindu in the middle of the word outnumber the cases where it does. So I could build up a database of known words that do employ bindu and use smart matching.
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
i did multiple searches on the cologne-sanskrit dictionary for occurence of aM, eM, iM, uM, oM (i think their transl. scheme use M only in right places - pl. confirm). The search is case-insensitive so it matches stuff we dont need. So some filtering was needed afterwards.
I saved the (massive) results on my local disk. Did some (programmatic) filtering and assuming I did it right, there are 3076 words in that dictionary which use M (in those contexts). The cumulative # of bytes for all these words is about 34K. Not that bad actually that loading it into memory with editor is not fully ruled out.
Of course the scheme that cologne-sanskrit dictionary uses is different and so some more "translation" is needed to our scheme (which can increase the # of chars). This is no big deal.
Drs - pl. let me if it is ok for me to send you the results to see if he list of matched words make sense (i.e. whether i got a good representative list).
Arun
I saved the (massive) results on my local disk. Did some (programmatic) filtering and assuming I did it right, there are 3076 words in that dictionary which use M (in those contexts). The cumulative # of bytes for all these words is about 34K. Not that bad actually that loading it into memory with editor is not fully ruled out.
Of course the scheme that cologne-sanskrit dictionary uses is different and so some more "translation" is needed to our scheme (which can increase the # of chars). This is no big deal.
Drs - pl. let me if it is ok for me to send you the results to see if he list of matched words make sense (i.e. whether i got a good representative list).
Arun
Last edited by arunk on 03 Feb 2007, 22:19, edited 1 time in total.
-
- Posts: 4066
- Joined: 26 Mar 2005, 17:01
I searched on Cologne too but used a different combination. Your combinations like am , eM will come up with what we dont need as well as you have roghtly pointed out that it is case=insensitive. But use these combinations, Mk, Mkh, Mg, Mgh etc. You cant go wrong here It is only in the (p, ph, b, bh, m) entad you will have problems. Also some overlap in (y,r,l). Otherwise we are fine.arunk wrote:i did multiple searches on the cologne-sanskrit dictionary for occurence of aM, eM, iM, uM, oM (i think their transl. scheme use M only in right places - pl. confirm). The search is case-insensitive so it matches stuff we dont need. So some filtering was needed afterwards.
There will easily more than 10,000 words. More towards 20K I estimate.I saved the (massive) results on my local disk. Did some (programmatic) filtering and assuming I did it right, there are 3076 words in that dictionary which use M (in those contexts).
The transliteration scheme used there is the H-K convention(Harvard-Kyoto). I had ealer in a post given a step-by-step procedure to convert H-K to our scheme. I think in this thread itself. Check thatOf course the scheme that cologne-sanskrit dictionary uses is different and so some more "translation" is needed to our scheme (which can increase the # of chars). This is no big deal.
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
Filtering out non-M was no big deal. There are several utilities on unix like systems (e.g. my mac) that makes this very easy.drshrikaanth wrote:I searched on Cologne too but used a different combination. Your combinations like am , eM will come up with what we dont need as well as you have roghtly pointed out that it is case=insensitive. But use these combinations, Mk, Mkh, Mg, Mgh etc. You cant go wrong here
I guess then I did something wrong in my steps. The total #of words (i.e. case-insensitive) was 51618. So it did match a lot. Still doesnt add up, either the dictionary does not include most of it, or my search criteria was wrong (it is quite difficult to screw-up the filter step - a very simple command), or i didnt save all the results.There will easily more than 10,000 words. More towards 20K I estimate.
Arun
Last edited by arunk on 04 Feb 2007, 00:40, edited 1 time in total.
-
- Posts: 3424
- Joined: 07 Feb 2010, 21:41
after exchanging some emails with drs, we solved a "mystery" as to why my searches werent getting all the words. Anyway the entire list is about 7400, which i think is still manageable (but need to confirm).
Arun
Arun
Last edited by arunk on 04 Feb 2007, 21:03, edited 1 time in total.