Single Transliteration Scheme for all CM Languages - Part 2

arunk · Post by **arunk** » 01 Feb 2007, 20:30

mahakavi,

I am not sure but there are a few variables:
1. The email program you used to send it. It should be able to send the content as "unicode". If it supports sending in HTML, i would presume it does
2. Which email program she used to read it. Again it should be able to read it, AND apply the correct font for the tamizh unicode text (browsers nowadays do this automatically). Of course this also depends on if a font that supports tamizh unicode is installed (nowadays this is not that much of a problem).

If you send it from a webmail account in HTML to another webmail account, i would have expected it to work. Let me do a bit of testing and see whats going on.

Arun

arunk · Post by **arunk** » 01 Feb 2007, 20:45

mahakavi,

i sent mail from my yahoo account to myself (i.e. yahoo) and also another account for which outlook is the reader.

If i send mail but my compose settings say "send as plain text", the mail on receipt does appear in tamizh but words in a single line are split across multiple lines.

After I fix the compose settings to say "send with colors and graphics" (i.e. html/rich-text), it appeared fine on receipt (both yahoo and outlook).

Find out what email program your friend uses and/or send it to a webmail account or an account that uses Outlook as reader.

Arun

arunk · Post by **arunk** » 01 Feb 2007, 20:49

i spoke too soon

Send as plain text: Looks correct on Outlook, but wrong (in tamizh but words in a single line are split across multiple lines) on Yahoo
Send as HTML: Looks correct on Yahoo, but wrong (not even in tamizh, they show up as HTML uncode entities (e.g. அ for tamizh "a") on Outlook.

So depending on which account you send to, you need to send it differently

. What a wacky world!

Arun

rshankar · Post by **rshankar** » 01 Feb 2007, 22:12

Arun,
Can you help me to fix the bolded words here:
इक परदेसि मेरा दिल ले गया
जाते जाते मीठा मीठा गम दे गया

आप यूङ् ही अगर हमसे मिलते रहे
देखिये एक दिन प्यार हो जायेगा

आंखोन् से जो उतरी है दिल् मे
(क्या बात है उस परवाने मे)
खुद ढूङ्ढ रही है शम्मा जिसे
क्या बात है उस पर्वाने मे?

Thanks...

mahakavi · Post by **mahakavi** » 01 Feb 2007, 22:13

arunk:
Thanks.
What about hotmail?
I will try it myself soon.

arunk · Post by **arunk** » 01 Feb 2007, 22:17

ravi,

i am very poor (used to be a zero) in reading devanagiri. What is the transliteration text you entered?

Thanks
Arun

mahakavi · Post by **mahakavi** » 01 Feb 2007, 22:22

Is it "AnkhOne"?

rshankar · Post by **rshankar** » 01 Feb 2007, 22:25

yU#n - what I wanted to show up was yU with a chandrabindu on top...

A#nkhEn - Need a chandrabindu over A...I figured out the rest of the issues when I played around it with the scheme...

arunk · Post by **arunk** » 01 Feb 2007, 22:27

deleted

arunk · Post by **arunk** » 01 Feb 2007, 22:28

The candrabindu is that teeny weeny dot. Thats how the font renders it.

(unless i am wrong)

arunk · Post by **arunk** » 01 Feb 2007, 22:29

rshankar wrote:yU#n - what I wanted to show up was yU with a chandrabindu on top...

No support for this. Is this hindi specific or does sanskrit have it too?

Arun

rshankar · Post by **rshankar** » 01 Feb 2007, 22:30

candrabindu is a dot (the bindu part) inside a quarter circle (the candra part - would look like the parenthesis that ends this part rotated 90 degrees clockwise)...
Also, how do I get a bindu atop the last letter of a word?

rshankar · Post by **rshankar** » 01 Feb 2007, 22:31

arunk wrote:
rshankar wrote:yU#n - what I wanted to show up was yU with a chandrabindu on top...
No support for this. Is this hindi specific or does sanskrit have it too?

Arun

I think it is Hindi specific...not too sure.

arunk · Post by **arunk** » 01 Feb 2007, 22:33

Got it. Now for sanskrit, when is candrabindu used vs when is anuswara (the dot) used?

I am generating anuswara here (in all cases), i am wondering if for sanskrit i should generate candrabindu always - or whether it is dependent on context.

Thanks
Arun

arunk · Post by **arunk** » 01 Feb 2007, 22:36

btw, i have seen ateast one CM book in sanskrit where the anuswara is used for pa#nkaja etc.

Arun

ramakriya · Post by **ramakriya** » 01 Feb 2007, 23:22

arunk,

Seems to be a new bug now:

The word SR.ngAra ( as the rasa) incorrectly transliterates (exept in tamizh) as below

श्र्.ंगार శ్ర్.ంగార ಶ್ರ್.ಂಗಾರ ஸ்2ரு2ங்கார

-Ramakriya

arunk · Post by **arunk** » 01 Feb 2007, 23:31

i will check. Strangely sR.ng works fine

. Just indicates that it is a bug in handling 'S'. It looks like it is getting confused with the SrI logic.

Arun

mahakavi · Post by **mahakavi** » 02 Feb 2007, 00:04

arunk:
When I send the Thamizh text from roadrunner to roadrunner(myself) email where I use Outlook express, the message reads fine. But hotmail to hotmail or roadrunner it is all a lot of numbers. Hotmail to Yahoo the fidelity of the text is preserved but as you said the text gets split into numerous lines.

When I send from roadrunner to yahoo, or hotmail addresses (using unicode format), it is again mumbo-jumbo stuff different from the previous gibberish numbers.

Well I'll leave it there. Don't bother to resolve it unless you get to resolve it by sheer luck.

arunk · Post by **arunk** » 02 Feb 2007, 00:14

ramakriya - i uploaded a fix for SR.ngAra.

mahakavi - there is not much we can do. It is not something unique to what I am generating, but is a problem with sending over unicode text. But i am surprised hotmail doesnt handle itself!

Arun

rshankar · Post by **rshankar** » 02 Feb 2007, 09:46

divakar wrote:rshankar: i wonder how you got the language script.

I used Arun's transliteration program.

arunk · Post by **arunk** » 02 Feb 2007, 22:12

divakar - pl. see http://arunk.freepgs.com/cmtranslit (and also threads under Languages section here on the forum).

ravi - If hindi uses candrabindu for pa#nkaja (pa~nca), but sanskrit uses anuswara, then only way would be to support hindi as a separate language. This is of course possible but strictly speaking would throw a wrinkle in the nomenclature of the scheme

. But there are some issues. Does hindi use candrabindu for words like mAm (i.e. words ending in "m")? Another wrinkle is words like yU#n may quite difficult to represent in other languages unless we use qualifiers. The trouble is in non-tamizh scripts, successive consonants can be combined into single glyphs, and that makes qualifiers harder atleast in unicode representation.

Also, for words like yU#n is the ending sound really supposed to represent the character #n here?

Sorry if those questions dont make sense. Yet again coming across a language which i dont know well

Arun

rshankar · Post by **rshankar** » 02 Feb 2007, 23:43

Arun,For the most part Hindi and Sanskrit use the same style for forming words...it is the urdU words that may make Hindi different...for instance, when Om is written in sanskrit, a chandrabindu is used...that is what I mean. For pa~nca, pan#nkaja, a bindu will suffice, but for Ankh, or yUn, a candrabindu is needed. I am not a very 'rules' oriented speller, but instinctively get hindi spelt correct (don't ask me why or how, becuase in all other languages, I have learnt to ignore my instincts to get the spelling right!)...

drshrikaanth · Post by **drshrikaanth** » 03 Feb 2007, 00:08

Ravi
Sanskrit and hindi do not follow the same pattern of writing their words. A bindu will not suffice in sanskrit for words like pa#nkaja, pa~nca. The letters have to be clearly shown in the conjunct.

Hindi is a lot like kannaDa, telugu and many others in that respect. the spellings are simplified.

arunk · Post by **arunk** » 03 Feb 2007, 00:31

drshrikaanth wrote:A bindu will not suffice in sanskrit for words like pa#nkaja, pa~nca. The letters have to be clearly shown in the conjunct.

Then at least one book i have doesnt follow this convention as it is using the bindu. May be it is following hindi rules or perhaps a convention while not kosher is ok to many people (my wife who has learned sanskrit didnt seemed bothered by it)

It is the book on Syama Sastry krithis by Smt. Vidya Sankar (has devanagiri, telugu, tamil and english). I can check other books.

A possibility is to make it an option.

Arun

drshrikaanth · Post by **drshrikaanth** » 03 Feb 2007, 00:47

arunk
IIRC We have discussed this very same issue earlier in the previous thread on transliteration. Surely we dont need a recap?

drshrikaanth · Post by **drshrikaanth** » 03 Feb 2007, 01:07

See post 127 and around it. Same logic holds here as well. There may have been another discussion as well about this. You maybe abe to fiind it

http://rasikas.org/forums/viewtopic.php?pid=27698#p27698

arunk · Post by **arunk** » 03 Feb 2007, 02:01

its possible we discussed this but the case of "M" at end is what I remembered and it is implemented (i.e. without anuswara for sanskrit).

I did the anuswara for panca etc. based on that book i was talking about. But i also vaguely remember seeing other sources like http://carnatica.net/lyrics/ooth9.pdf, where anuswara is used at the end (!) but not in nca/cha etc. (2nd krithi). There doesnt seem any consistency - atleast thats what I thought.

Once I put sanskrit logic, I had asked people several times to point out any errors so that i can fix the logic after I put it up. I didnt hear a peep. Perhaps they assumed i wasnt listening or incapable of listening

Arun

arunk · Post by **arunk** » 03 Feb 2007, 02:26

never mind - i think it is easy to make it an option. The default would be no anuswara in the middle or at end, but people can change it if they want. The second would handle most of hindi except for the urdu influenced words.

Arun

drshrikaanth · Post by **drshrikaanth** » 03 Feb 2007, 02:27

arunk wrote:But i also vaguely remember seeing other sources like http://carnatica.net/lyrics/ooth9.pdf, where anuswara is used at the end (!) but not in nca/cha etc. (2nd krithi). There doesnt seem any consistency - atleast thats what I thought.

You still have doubts about "m" occurring at the end!:rolleyes:

Once I put sanskrit logic, I had asked people several times to point out any errors so that i can fix the logic after I put it up. I didnt hear a peep.

Wish I had all the time in the world(And no job) to answer you queries

Its the same logic in midde as well. "Nearly" Always show the vyanjanas explicity even when the conjunct has a nasal consonant as the 1st half. "Nearly" beacusem there are some exceptions like samyukta where "sam" is a prefix to an otherwise independent word(yukta in this case). This means samga will not be saMga. saMsarga, saMyukta, saMtOSha, saMgIta, saMgAna etc yes but Not saMga, saMkaTa, etc. Note here that tOSha, yukta, gIta and gAna are independent words witha saM suffix but Noy ga, kaTa,

Likewise words with "kAra" suffix like ahaMkAra, jhaMkAra will feature the bindu only, not the consonant itself. I am not sure if there are exceptions to this. My thinking tells me other "suffixes" like cAra will also behave similarly. Basically, if they are one unit and form an integral part of the word to make sense, use consonant. If added as suffix or prefix, use anuswAra/bindu.

arunk · Post by **arunk** » 03 Feb 2007, 02:27

ramakriya - did the export to dokuwiki feature help?

Thanks
Arun

arunk · Post by **arunk** » 03 Feb 2007, 02:45

drshrikaanth wrote:You still have doubts about "m" occurring at the end!:rolleyes:

i guess i do now

. Doesnt that pdf file use anuswara at the end (e.g. santatam aham)? My point is whatever the correct rules are, in practice (i am guessing owing to hindi's popularity), there are variations (?)

Please also check my other post in languages thread in response to rules you mention

Arun

drshrikaanth · Post by **drshrikaanth** » 03 Feb 2007, 02:50

Arun
We have dissussed at length about m/M use in the end. We aso discussed the reasons for variations- not necessarily hindi's popuarity but because of the influence of spelling in one's mother tongue. If you still have doubts, Iam not responsible for it. I dont have doubts in this matter at least.

arunk · Post by **arunk** » 03 Feb 2007, 02:51

DRS said the following in another thread regarding rules as to when anuswara appears in sanskrit in the middle of words:

Its the same logic in midde as well. "Nearly" Always show the vyanjanas explicity even when the conjunct has a nasal consonant as the 1st half. "Nearly" beacusem there are some exceptions like samyukta where "sam" is a prefix to an otherwise independent word(yukta in this case). This means samga will not be saMga. saMsarga, saMyukta, saMtOSha, saMgIta, saMgAna etc yes but Not saMga, saMkaTa, etc. Note here that tOSha, yukta, gIta and gAna are independent words witha saM suffix but Noy ga, kaTa

Likewise words with "kAra" suffix like ahaMkAra, jhaMkAra will feature the bindu only, not the consonant itself. I am not sure if there are exceptions to this. My thinking tells me other "suffixes" like cAra will also behave similarly. Basically, if they are one unit and form an integral part of the word to make sense, use consonant. If added as suffix or prefix, use anuswAra/bindu.

Unless I am mistaken, things got a bit complicated now.

What this tells me is that for my logic, it would be best if I force anuswara for sanskrit, only in the middle and only if explicitly specified as M and let people specify it judiciously (i.e. it would be too difficult for the logic to know which is one unit vs. suffix etc)..

But for languages like kannada, telugu when preceding k(h)a, g(h)a, c(h)a, j(h)a (and others), the anuswara always figures right? So this would mean that specifying M in the middle for stuff should be used judiciously even when entering for other languages - should be used ONLY if it is an anuswara in sanskrit, otherwise sanskrit rendition would be screwed up. This is certainly a big wrench since a person entering telugu or kannada, and even worse tamil may have no idea about these rules in sanskrit.

This also means that for such words "phonetically better variant in english" would be wrong and cannot be used (i.e. never sangIta, always saMgIta)

This allows me to ask a question which i have add ever since i was exposed to it: What is the purpose behind the answara? It seems they represent some other sound for which a character does exist in the script? Why then not use the character itself?

(or may be i should retire to a "less than perfect" sanskrit rendition - i.e. always use anuswara or never use anuswara)

Arun

ramakriya · Post by **ramakriya** » 03 Feb 2007, 02:51

arun - I have not experimented with the export feature yet.

I found one problem with the variables. Or I may not have understood how to use it

1. If I type as caraNam -then the kannaDa transliteration should show it as caraNa. Right? But that is not happening. It does show up as caraNam, with a bindu at the end

2. The key word is not recognized as a variable at all sometimes - even though the spelling is correct.

-Ramakriya

arunk · Post by **arunk** » 03 Feb 2007, 02:54

variables are experimental.

#1. It shows up as caraNam because I seemed to have (incorrectly) defined it as such. I need some help in knowing this (for all). I know drs gave the kannada equivalents, i need to go and incorporate them.

#2: Even when you click on a word and hit the "$" button? If so, can you give an example? If it is on "convert all" (i.e. 3 arrows pointin to $ button), then it is on purpose. I didnt want to mistakenly convert words in the sAhitya portion and thus am extra careful in looking for certain patterns.

Arun

arunk · Post by **arunk** » 03 Feb 2007, 02:59

did i say you are wrong or that i was somehow right so as to try to put doubts in your mind?

Jeez!

Arun

drshrikaanth · Post by **drshrikaanth** » 03 Feb 2007, 03:20

Did I say you did that to me! Jeez!;)

arunk · Post by **arunk** » 03 Feb 2007, 03:28

Unless my assumptions/conclusions are yet again wrong, i am thinking of doing the following
1. Change the word "Sanskrit" to "Devanagiri" as it appears on the editor. This is mainly to indicate that generate script may not be considered proper Sanskrit as all written rules are not followed
2. Have 2 anuswara options for devanagiri:
(i) Always generate (so more like Hindi)
(ii) Never generate (closer to Sanskrit but not that close=> words like sangIta would be all messed up).

I dont know if this salvages the situation enough. Also I dont know if option 2(ii) is that useful as it would be a mixed bag (neither hindi like nor sanskrit like)

Suggestions?

Arun

jayaram · Post by **jayaram** » 03 Feb 2007, 03:34

Arun - the bindu at the end is how I know, based on my Sanskrit classes in school and college. Usage of M seems to be a variation, sometimes for aesthetics. If you read thru No.2 kriti (vAnchasi yadi) in the pdf file, you will find occurrence of both m and M for the word kuSalam/kuSalaM. To make it simple for yourself, I would suggest you go with the bindu version.

Btw, the way they have written rAgaM and tAlaM is jarring, at least to my eyes!

Also, 'ambika' (as in kamalAmbika) is not written with bindu, the half-consonant is used. At least this is the way I have read and written all these years.

jayaram · Post by **jayaram** » 03 Feb 2007, 03:37

Also you will note 'vAnchasi' is written without bindu, but with the half-consonant.

jayaram · Post by **jayaram** » 03 Feb 2007, 03:46

And DRS is correct in saying that one's mother tongue has an influence on how these are written in Sanskrit. Coming from a Kerala background, I was taught to use the half-consonants instead of the bindu in most cases (within words). Malayalam follows similar rules.

The Namboodiris of Kerala are reputed to have the 'most authentic' knowledge of Sanskrit, so obviously I had assumed we were taught the most accurate version!

(finally, perhaps we should move this language discussion to where it belongs - arun's thread!
Let OP-ji rest in peace!)

arunk · Post by **arunk** » 03 Feb 2007, 05:10

I found this link which talks about anuswaras in context of sandhi rules:
http://www.sanskrit-sanscrito.com.ar/en ... rules.html. It talks about when "m" at end of word becomes anuswara and when it does not. Basically if it is followed by a word that begins with a consonant.

This seems to be followed here: http://sanskrit.safire.com/pdf/DURGA700.pdf, where you have cases where "m" at end is rendered as consonant, and also cases where you have it as anuswara. You see it at an anuswara at end a "line"/"sentence" (so no word to follow and hence no consonant to follow) i.e. before a | or ||. For example, the title itself, first line on the right side, and also several other places. You see the bindu used "within a line/sentence". The cases of bindu inside words is much much rarer (but is there on page 6 - "saMhati..."(?), also on page 12 - saMyugE (?)), and that is of course what drs said.

Of course I dont know how official/authentic these are but atleast I wanted to see some reasoning behind the "mixture of bindu and no bindu cases" - and I see it now.

Now the rule for end of word within a sentence and followed by a consonant is something that is possible to program.

The trouble is when bindu occurs in the middle depends on interpretation of words etc. and not possible to program without an elaborate setup with look ups to dictionary and such.

So I think we are still down to either

(a) use it like telugu and kannada, and hindi. (i.e. always use it).
(b) or not use it.
(c): use it only at end (i.e. following end of word rule above) but never in the middle.

Of course all of them are not correct for Sanskrit, but I am guessing/hoping that

(a) would be ok for people to read (as they may apply their native language rules).
(c) looks like closer to sanskrit and ma....y be passable although it will definitely messup words that drs mentioned.
if (c) is done at all, (b) is useless

Can people pl. chime in and give me advice on whether (a) is ok, and whether i should even bother with (c)?

Thanks
Arun

jayaram · Post by **jayaram** » 03 Feb 2007, 15:25

Arun - I get the feeling if you go with option (a) for Devanagari, we may do the same for Malayalam! And it does look a bit weird if this option is used in Malayalam, at least for old-timers like myself.

My own take on this:
1. ok to use bindu across the board for the endings. as i said earlier, the M ending is for aesthetics, don't believe there's a rigid rule for this.
2. use half-consonant within a word using the appropriate rules - tough to implement, I agree, but at least this can be done for certain often-occurring words, perhaps you could look thru Dikshitar kritis for words such as 'ambika': http://www.rogepost.com/n/4405894335

arunk · Post by **arunk** » 03 Feb 2007, 20:08

yes jayaram it would be less than ideal for malayalam - that is not good either.

I will try the more difficult approach. For sanskrit (and malayalam too?), as drs indicated, the # of cases which DONT employ bindu in the middle of the word outnumber the cases where it does. So I could build up a database of known words that do employ bindu and use smart matching. So by default no bindu except for these known words. This will handle amba etc. correctly by default. It will also handle sangIta, santOsha (assuming they are in database).

On top of that, it may be possible to introduce a feature in the editor (not the scheme), to force use of bindu in sanskrit/malayalam for a specific word. So with a combination of this and the database of known words, we may be able to get things right. Although unless the database of known words is good (so that it takes care of almost all common cases of occurences in kriti), it would be a pain for the user to have to spoon feed the editor.

I will look into this.

Thanks
Arun

drshrikaanth · Post by **drshrikaanth** » 03 Feb 2007, 20:22

arunk wrote:For sanskrit (and malayalam too?), as drs indicated, the # of cases which DONT employ bindu in the middle of the word outnumber the cases where it does. So I could build up a database of known words that do employ bindu and use smart matching.

Forget about doing this Arun as the list of words will stretch to several thousands! I just checked. The way out would be to link up with a pre-existing onine dictionary and match with that spelling.

arunk · Post by **arunk** » 03 Feb 2007, 21:54

I was afraid of that. It may be possible to interface with a dictionary (or build our own which can be interfaced more easily). Of course more work

but not herculean

Arun

arunk · Post by **arunk** » 03 Feb 2007, 22:18

i did multiple searches on the cologne-sanskrit dictionary for occurence of aM, eM, iM, uM, oM (i think their transl. scheme use M only in right places - pl. confirm). The search is case-insensitive so it matches stuff we dont need. So some filtering was needed afterwards.

I saved the (massive) results on my local disk. Did some (programmatic) filtering and assuming I did it right, there are 3076 words in that dictionary which use M (in those contexts). The cumulative # of bytes for all these words is about 34K. Not that bad actually that loading it into memory with editor is not fully ruled out.

Of course the scheme that cologne-sanskrit dictionary uses is different and so some more "translation" is needed to our scheme (which can increase the # of chars). This is no big deal.

Drs - pl. let me if it is ok for me to send you the results to see if he list of matched words make sense (i.e. whether i got a good representative list).

Arun

drshrikaanth · Post by **drshrikaanth** » 04 Feb 2007, 00:24

arunk wrote:i did multiple searches on the cologne-sanskrit dictionary for occurence of aM, eM, iM, uM, oM (i think their transl. scheme use M only in right places - pl. confirm). The search is case-insensitive so it matches stuff we dont need. So some filtering was needed afterwards.

I searched on Cologne too but used a different combination. Your combinations like am , eM will come up with what we dont need as well as you have roghtly pointed out that it is case=insensitive. But use these combinations, Mk, Mkh, Mg, Mgh etc. You cant go wrong here

It is only in the (p, ph, b, bh, m) entad you will have problems. Also some overlap in (y,r,l). Otherwise we are fine.

I saved the (massive) results on my local disk. Did some (programmatic) filtering and assuming I did it right, there are 3076 words in that dictionary which use M (in those contexts).

There will easily more than 10,000 words. More towards 20K I estimate.

Of course the scheme that cologne-sanskrit dictionary uses is different and so some more "translation" is needed to our scheme (which can increase the # of chars). This is no big deal.

The transliteration scheme used there is the H-K convention(Harvard-Kyoto). I had ealer in a post given a step-by-step procedure to convert H-K to our scheme. I think in this thread itself. Check that

arunk · Post by **arunk** » 04 Feb 2007, 00:39

drshrikaanth wrote:I searched on Cologne too but used a different combination. Your combinations like am , eM will come up with what we dont need as well as you have roghtly pointed out that it is case=insensitive. But use these combinations, Mk, Mkh, Mg, Mgh etc. You cant go wrong here

Filtering out non-M was no big deal. There are several utilities on unix like systems (e.g. my mac) that makes this very easy.

There will easily more than 10,000 words. More towards 20K I estimate.

I guess then I did something wrong in my steps. The total #of words (i.e. case-insensitive) was 51618. So it did match a lot. Still doesnt add up, either the dictionary does not include most of it, or my search criteria was wrong (it is quite difficult to screw-up the filter step - a very simple command), or i didnt save all the results.

Arun

arunk · Post by **arunk** » 04 Feb 2007, 05:42

after exchanging some emails with drs, we solved a "mystery" as to why my searches werent getting all the words. Anyway the entire list is about 7400, which i think is still manageable (but need to confirm).

Arun