Resource Centre for Indian Language Technology Solutions Bangla
Project Basics
Activities
Demo
Products
Downloads
Related Sites
Search
FAQ
Contact Address
Home

 

Technical Characteristics

Like other Indian scripts that evolved from the Brahmi script, Bangla is also written from left to right and consists of sequence of simple and complex characters.

         

 

Basic characters

There are 11 vowel and 39 consonant characters in modern Bangla alphabet. They are called basic characters. As each of these basic characters has a standalone code (character) in the common core, this set is also referred to as primary or independent character set. Each of these assigns a Hexadecimal code point in ISCII standard from 131-144 for successive vowels and for successive consonants, 146-232 respectively.

In Unicode Independent vowels are placed from U0985 to U0994, some of the places are reserved (098D, 098E, 0991, 0992). Consonants ranges from U0995 to U09B9 and again from U09DC to U09DF.

The basic characters and their standard representations in Roman characters are shown in the following two tables. The concept of upper/lower case is absent in Bangla.

 

        % (A) (AA) + (I) < (II) = (U) > (UU) @ (R) A (E) B (AI) C (O) D (AU)

        Table I: Vowel set

        E (KA) F (KHA) G (GA) H (GHA) I (NGA) J (CA) K (CHA) L (JA) M (JHA) (NYA) O (TTA) P (TTHA) Q (DDA) R (DDHA) S (NNA) T (TA) U (THA) V (DA) W (DHA) X (NA) Y (PA) Z (PHA) [ (BA) \ (BHA) ] (MA) ^ (YA) Ì[ (RA) _ (LA) ` (SHA) b (SSA) a (SA) c (HA) QÍö (RRA) RÍô (RHA) Ì^ (YYA)

    Table II: Consonant set

Top

 

Modifiers and compound characters

A vowel other than % (A) following a consonant takes a modified shape, depending on the position of the vowel: to the left, right (or both) or bottom of the consonant. These are called dependent vowel set, vowel modifiers, or vowel allographs.

 

        (AA) # (I) (II) (U)(UU) #Ê (R) å# (E) é# (AI) å#ç (O) å#ì (AU)

        Table III: Vowel Modifier

# indicates the consonant character position. For example, for the first consonant character (KA), the vowel allographs are supposed to be attached with the consonant in the following way: Eõç,, EõÝ,, EÉõ, åEõ, éEõ, åEõì and åEõì. The vowels = (U) > (UU) @ (R) may take different modified shapes when attached to some consonant characters. They also change the shape of some consonant characters to which they are attached. In Unicode, modifiers (dependent vowel signs) are ranged between U09DE to U09CC.

Like in other Indic scripts, some of these vowel modifiers use two-part vowel signs. In those vowels one-half of the vowel is placed on each side of a consonant letter or cluster for example, å#ç (O, U+09CB) and å#ì (AU, U+09CC). It may be noted that in Unicode, these vowel signs are coded in each case in the position in the charts isomorphic with the corresponding vowel in Devanagari. Hence Bangla vowel sign å#ì (AU, U+09CC) is isomorphic with Devanagari vowel sign è (AU, U+094C). To provide compatibility with existing implementations of the scripts that use two-part vowel signs, the Unicode standard explicitly encodes the right-half of these vowel signs; for example, Bangla length mark of AU, (U+09D7) represents the right-half glyph component of Bangla vowel sign å#ì (AU, U+09CC).

A consonant following (preceding) a consonant is represented by a modifier called consonant modifier if the shape of other character to which it is attached, remains unaltered. Among consonant modifiers, É (YA-phalaa) is the most frequent one. This character is a presentation form of the character ^ (YA, U+09AF). It takes this shape depending upon the context. It could be applied to nominal form of any consonant or conjunct or even to the vowel. Application of this character may either bring in repetitive sound or 'ya'-ness of the character it is applied to. Though it is a presentation form of an existing character, it should be treated like a separate character because of the two reasons. This character may appear along with a vowel where the conjunct formation rules cannot be applied. As for example:

% + + --> %

Apart from that, Ì[ (RA, U+09B0) and Ì^ (YA, U+09AF) combination has two different conjunct forms:

[ + b --> b

[ + ^ --> [

Using the same conjunct rule both of them cannot be formed. If it is treated as a combining character, rather than a original form of YA, both of these issues could be solved.

If the shape/size of all involved characters are changed from that of their respective basic characters, then the cluster of these attached consonants is called a compound character or yuktakshar. In Bangla, consonant conjuncts can be formed as vertical conjuncts, where the components are placed vertically unlike the normal conjuncts where components are placed side-by-side. To construct these conjuncts, ligatures and consonants might take upper-half and lower-half forms. Compounding of two consonants is most abundant although three consonants can also be compounded. There are about 250 compound characters, of which a sub-set of frequently used characters is shown in Table below. The total number of basic, modified and compound characters is about 300.

        Two characters are conjunct: (KA+KA) Â(NGA+KA) {(LA+KA) ©(SSA+KA) (SA+KA) (NGA+KHA) (SA+KHA) /(NGA+GA) (LA+GA) )(GA+GA) (CA+CA) (NYA+CA) (SHA+CA) (CA+CHA) >(NYA+CHA) ((SHA+CHA) #(JA+JA) ?(NYA+JA) r(BA+JA) ~(JA+JHA) @(NYA+JHA) :(JA+NYA) (CA+NYA) "(KA+TTA) A(TTA+TTA) I(NNA+TTA) KéÂ(NA+TTA) IÂ(PA+TTA) (LA +TTA) ©(SSA+TTA) (SA+TTA) F(NNA+TTHA) _(NA+TTHA) Â(SSA+TTHA) (DDA+DDA) (NNA+DDA) (NA+DDA) (LA+DDA) I(NNA+DDHA) (NA+DDHA) J(NNA+NNA) øÀ(SSA+NNA) (HA+NNA) M(KA+TA) M(TA+TA) LÃ(NA+TA) l(PA+TA) (SA+TA) O(TA+THA) L(NA+THA) (SA+THA) V(DA+DA) j(NA+DA) s(BA+DA) *(GA+DHA) (DA+DHA) (NA+DHA) t(BA+DHA) (GA+NA) â®(GHA+NA) P(TA+NA) ñ®(DHA+NA) i(NA+NA) (PA+NA) (MA+NA) (SHA+NA) (SA+NA) ý(HA+NA) m(PA+PA) ó(MA+PA) (LA+PA) ©ó(SSA+PA) ó(SA+PA) (MA+PHA) (LA+PHA) ©(SSA+PHA) Â(SA+PHA) $(KA+BA) (GA+BA) æ;(JA+BA) éWÂ(TTA+BA) (NNA+BA) Q(TA+BA) ïW(THA+BA) Z(DA+BA) ñT(DHA+BA) i(NA+BA) õT(BA+BA) (MA+BA) {(LA+BA) (SHA+BA) (SA+BA) ýW(HA+BA) äWÂ(CA+BA) WW (DDA+BA) (MA+BHA) (DA+BHA) %(KA+MA) (GA+MA) [(NGA+MA) (NNA+MA) R(TA+MA) (DA+MA) i(NA+MA) (MA+MA) {(LA+MA) (SHA+MA) (SSA+MA) (SA+MA) p(HA+MA) S(KA+RA) àè(KHA+RA) áè(GA+RA) âè(GHA+RA) (PA+RA) ^(DA+RA) ñè(DHA+RA) <(JA+RA) S(TA+RA) ïè(THA+RA) õè(BA+RA) w(BHA+RA) ôèÂ(PHA+RA) éÂ(TTA+RA) E(DDA+RA) ý(HA+RA) Ú(SA+RA) |(SHA+RA) Ú(MA+RA) ßvÂ(KA+LA) (GA+LA) (PA+LA) ôvÂ(PHA+LA) õv(BA+LA) (MA+LA) {(LA+LA) (SHA+LA) (SA+LA) Ê(HA+LA) (KA+SSA) '(KA+SA) k(NA+SA) o(PA+SA)

        Three characters are conjunct: ]®Â(KA+SSA+NNA) (KA+SSA+MA) BåW(CA+CHA+BA) (CA+CHA+RA) #;(JA+JA+BA) H(NNA+DDA+RA) MW(TA+TA+BA) \(DA+BHA+RA) bL(NA+TA+BA) La(NA+TA+RA) f(NA+DA+RA) e(NA+DA+BA) (MA+PA+RA) z(MA+BHA+RA) (SA+PA+RA) a(SA+TA+RA) g(NA+DHA+RA) (SA+TTA+RA) b(SA+TA+BA) (LA+DDA+RA) ©Â(SSA+KA+RA) \(DA+BHA+RA) Y(DA+DHA+BA) (NA+DDA+RA) H(NNA+DDA+RA) Â(SA+KA+RA) "(KA+TTA+RA) ©ó(SSA+PA+RA) VW(DA+DA+BA) (NNA+TTA+RA)

    Table V: Conjunct Character set in Bangla

In the following, technical issues related to few characters (including some special characters) are separately elaborated:

Top

Khanda-ta (d)

It is a combining character, which is widely used in Bangla script. For some transliterated words it might appear at the beginning. Examples: cdEX, ]cd, dL_Q and Xda.

 

Visarga (f)

This character is used for writing Sanskrit words in Bangla script.

 

Nukta

Bangla script does not use , the Nukta sign (U+09BC) explicitly. Internally QÍö, (RRA, U+09DC), (RHA, U+09DD) and Ì^ (YYA, U+09DF) are formed with the help of Nukta sign.

Top

 

Avagraha

The avagraha is used for indicating the presence of the vowels % (A) and (AA) that are sometimes elided in enphonic (sandhi) combination, which is a pervasive feature of Sanskrit and its cognate languages like Bangla. Such elision occurs when the A or B at the beginning of a word follows a word having (AA) or f (visarga) at the end.

Thus, yields the form , where the mark indicates that the expression formed by sandhi, and the second word that has been combined here contains 'A' at the beginning. In the absence of avagraha such disjoining of the enphonic combination becomes difficult, and their meaning cannot be easily understood. Even during these days, thousands of copies of holy works like Raamaayana, Mahaabhaarata, Bhaagavad-Gita, Durgaasaptasati and other scripts dealing with rituals like marriage, funeral, worship of Shiva, Vishnu, Durgaa, Kaali etc. that are written in Sanskrit are printed in Bangla script. In the absence of avagraha, correct and dependable rendering of Sanskrit works in the Bangla script will be impossible.

Top

 

Numerals

In Bangla compound numbers from 11 to 19 and the components involving them in higher numbers are pronounced in such a way that it is somehow impossible to understand where it belongs. For example, 11 is read as 'egaro', from this it is not possible to understand that this number is after 10 or 20. Another interesting fact is that the numbers from 21 to 99 are written from left to right but their number names are counted or read from right to left. However, after 100 or 1000 the digits at 100th or 1000th position are read first then the rest. Hundred is called 'shata', Thousand is called 'sahasra', Ten Thousand is called 'azut', Hundred Thousand is called 'lakh' or 'lakhkha', One Million/Ten Lakhs is called 'nizut' and Ten Million is called 'koti'. The numerals in Bangla are as follows:

 

        0

        1

        2

        3

        4

        5

        6

        7

        8

        9

        0

        1

        2

        3

        4

        5

        6

        7

        8

        9

        Table VI: Bangla numeral set

In Unicode, numerals are ranged between U09E6 to U09EF.

Top

Punctuation Marks

Modern Bangla uses punctuation marks, which are borrowed from English except the end-of-sentence marker. Old bangla books contain single and double vertical bars to indicate a fullstop, but the modern bangla only uses the single vertical bar *.

 

Ancient/Obsolete glyphs

U+09F2 to U+09F9 are a series of Bangla addition for writing currency and fractions. Among these, are the ancient Bangla glyphs frequently used on or before mid thirties of the last century. After the recommendation of the spelling reform committee of Calcutta University in 1936, the use of all these glyphs becomes infrequent. Still we find their usage in some documents and if any one wants to get the digital copy of ancient documents one has to have these glyph supports in font files as well as in Unicode. Hence, these glyphs are not discarded from Unicode.

Top

 

Character Statistics

Corpus based statistical analysis of any language is very useful in various applications including OCR, cryptography, linguistics, speech analysis and recognition, spelling error correction, electronic dictionary and machine aids to visually handicapped. The following are frequencies of characters in Bangla corpus provided by the MIT database, which consists of more than 3 million words (total Characters: 1,43,18,761) of running text covering a wide range of genre viz., modern fiction, short stories, newspapers and journals.

The global grapheme statistics of characters in the corpus are presented in the following Table.

Character

Percentage

Vowel

36.39

Consonant

63.61

Table VII: Character statistics summary table

The table shows that consonant percentage is much higher than vowel percentage and compound character percentage (7.34 %) is very small compared to other consonant and vowel. In commonly used language, there is a tendency of using words containing maximum number of consonants followed by vowel containing minimum number of compound character. The global character occurrence statistics would be useful for optical character recognition (OCR) development, spell-checker design and other problems.

The next table represents the consonants and vowels (vowels and their modified forms) according to their percentage of occurrence in the said corpus.

 

Char.

% of ocuur.

Char.

% of ocuur.

Char.

% of ocuur.

10.58635

+

1.28592

0.30554

Å#

9.12345

`

1.27980

0.27616

Ì[ý

9.07098

A

1.16347

0.26520

#

5.79748

1.15866

0.25636

X

5.28963

K

1.07006

0.25339

5.14978

1.00475

H

0.19234

4.63765

0.97830

0.14776

[

3.99412

B

0.90798

é#

0.09725

A

3.09293

U

0.85059

0.08548

_

3.01622

\

0.84297

Å#ì

0.08430

]

2.96961

0.82305

#f

0.05097

Y

2.70545

F

0.82027

B

0.03451

V

2.29465

%

0.76769

0.02161

Ì^

2.25143

C

0.72528

@

0.00990

^

2.11691

S

0.70735

<

0.00836

2.05253

#e

0.46951

RÍô

0.00666

C

1.56969

QÍö

0.45512

D

0.00569

Å#ç

1.40914

=

0.44692

>

0.00376

1.37966

0.33243

%îç

0.00005

G

1.34470

#g

0.32727

   

L

1.31026

0.31469

   
Table VIII: Character-wise percentage of occurrence

 

In the next Table, we show the most frequently used compound characters found in the corpus. The cluster (PA+RA) is maximally used followed by (KA+SSA), LÃ(NA+TA) and S (TA+RA).

Comp.

Char.

% of

Occur.

Comp.

Char.

% of

Occur.

Ã

9.950

1.768

4.590

îÂÉ

1.638

L

4.213

M

1.618

S

3.328

îÇÂ

1.536

õÉ

3.035

S

1.534

/

2.804

M

1.526

j

2.402

1.521

2.254

õþÉ

1.449

2.251

ó

1.413

ñÉ

2.184

^

1.376

áè

1.921

1.337

©

1.892

òÉ

1.324

Table IX: Percentage of occurrence of some compound characters

Positional Character Occurrence Statistics in Bangla is as follows: The percentages of occurrence of each consonant and vowel at each position of the words in the said corpus are tabulated in the Table IX. The character E (KA) occurs in maximum percentage (9.98%) at the first position of the words. The occurrence of character [ (BA), a (SA) and Y (PA) at the first position of words is quite high compared to other characters. The character Ì[ (RA) occurs in highest percentage at all but the first positions of the words.

Char.

Pos 1

Pos 2

Pos 3

Pos 4

Pos 5

Pos 6

% of occur.

Rank

% of occur.

Rank

% of occur.

Rank

% of occur.

Rank

% of occur.

Rank

% of occur.

Rank

Õ

3.57

12

0.01

37

0.82

30

0.02

39

0.02

40

0.03

38

Õ

4.69

7

0.01

36

1.05

24

0.03

38

0.03

38

0.04

37

ý

0.71

28

4.16

9

2.21

16

1.81

18

2.16

14

1.82

17

Ö

0.04

37

0.00

40

0.01

43

0.00

44

0.00

43

0.00

43

ëÂ

1.72

20

0.20

32

0.66

31

0.08

36

0.15

35

0.09

34

Ø

0.02

40

0.00

46

0.00

46

0.00

52

0.00

53

0.00

46

ÙÂ

0.04

36

0.00

43

0.01

44

0.00

42

0.00

42

0.00

42

Û

5.39

5

0.02

35

1.14

23

0.10

34

0.05

37

0.09

35

Ü

0.17

33

0.00

44

0.02

41

0.00

43

0.00

44

0.00

44

Ý

1.58

21

0.31

30

1.04

25

0.98

25

0.89

24

1.01

25

Þ

0.03

39

0.00

45

0.01

45

0.00

45

0.00

45

0.00

45

ßÂ

9.98

1

6.69

6

7.32

3

6.96

5

5.15

5

6.69

5

à

0.97

25

1.76

15

2.50

15

0.62

27

0.89

26

0.61

27

á

2.31

15

1.21

20

1.92

19

2.60

13

1.65

18

1.93

16

â

0.66

29

0.07

34

0.24

37

0.09

35

0.23

33

0.07

36

ã

0.00

41

1.63

18

0.29

36

0.35

29

0.35

31

0.22

32

äÂ

1.94

18

1.08

22

1.22

21

2.44

14

1.07

23

1.22

22

å

1.20

23

1.18

21

1.64

20

3.50

10

2.08

15

1.53

19

æ

2.43

14

0.89

26

2.62

12

1.77

19

1.64

19

1.77

18

çÂ

0.13

34

0.00

38

0.37

33

0.03

37

0.10

36

0.02

39

0.00

45

0.48

28

0.15

39

0.21

33

0.45

30

0.24

31

éÂ

0.54

30

1.27

19

2.74

11

2.22

15

3.45

9

2.48

13

êÂ

0.21

32

1.01

23

0.34

34

0.29

31

0.25

32

0.33

30

ëÂ

0.40

31

0.08

33

0.22

38

0.46

28

0.49

29

0.40

28

ìÂ

0.10

35

0.00

39

0.02

42

0.00

41

0.00

41

0.00

41

í

0.00

42

0.83

27

0.42

32

1.67

20

1.71

17

2.75

10

îÂ

4.56

9

5.32

8

5.19

6

9.16

3

8.48

4

9.56

2

ï

1.23

22

1.74

16

0.96

28

1.54

21

1.13

22

1.29

20

ð

4.49

11

2.53

12

2.51

14

3.66

9

3.02

10

3.88

8

ñ

0.74

27

1.65

17

0.95

29

1.87

16

1.57

20

1.08

23

ò

4.79

6

7.54

3

8.75

2

7.95

4

9.95

2

9.02

3

Âó

7.72

4

2.60

11

2.88

10

3.03

11

2.63

13

2.51

12

ôÂ

0.83

26

0.30

31

0.32

35

0.25

32

0.21

34

0.19

33

õ

8.68

2

6.97

5

4.78

7

6.08

6

4.02

7

4.33

7

öÂ

1.77

19

0.99

24

1.03

26

1.52

22

0.89

25

1.01

24

4.55

10

7.18

4

3.88

9

4.48

8

4.01

8

3.53

9

û

3.38

13

5.87

7

6.59

4

9.42

2

8.62

3

7.76

4

ûþ

2.07

17

18.79

1

13.87

1

11.49

1

18.41

1

17.57

1

õþ

1.18

24

7.72

2

6.51

5

4.77

7

4.71

6

5.12

6

ù

0.00

44

0.00

41

0.98

27

0.31

30

0.75

27

0.80

26

ú

2.12

16

0.90

25

2.57

13

1.32

23

1.72

16

2.39

14

ø

0.03

38

0.32

29

2.03

18

1.83

17

2.72

11

2.21

15

ü

8.42

3

2.94

10

4.06

8

2.90

12

2.64

12

2.72

11

ý

4.64

8

1.92

13

2.05

17

1.25

24

1.14

21

1.27

21

ëÂÿ

0.00

43

1.81

14

1.14

22

0.92

26

0.57

28

0.40

29

ÿ

0.00

46

0.00

42

0.03

40

0.00

40

0.02

39

0.01

40

Table X: Positional Character Occurrence Statistics in Bangla

 

Top

For more information mail us at rc_bangla@isical.ac.in