• For any query, contact us at
  • +91-9872993883
  • +91-8283824812
  • info@ris-ai.com

Indian Subcontinent

The Indian subcontinent is a term mainly used for the geographic region surrounded by the Indian Ocean: Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan and Sri Lanka.

within India itself, there are a multitude of languages that are spoken and used in day to day life which itself showcases the basic need to be able to build NLP based applications in vernacular languages.

There are a handful of Python libraries we can use to perform text processing and build NLP applications for Indian languages.

Natural Language Toolkit for Indic Languages

These are some of the languages of the Indian Subcontinent that are supported by libraries :

iNLTK- Hindi, Punjabi, Sanskrit, Gujarati, Kannada, Malyalam, Nepali, Odia, Marathi, Bengali, Tamil, Urdu
Indic NLP Library- Assamese, Sindhi, Sinhala, Sanskrit, Konkani, Kannada, Telugu,
StanfordNLP- Many of the above languages

iNLTK (Natural Language Toolkit for Indic Languages)

iNLTK provides most of the features that modern NLP tasks require, like generating a vector embedding for input text, tokenization, sentence similarity etc. in a very intuitive and easy API interface.

In [1]:
!pip3 install inltk
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: inltk in ./.local/lib/python3.6/site-packages (0.9)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.6/dist-packages (from inltk) (3.13)
Requirement already satisfied: beautifulsoup4 in ./.local/lib/python3.6/site-packages (from inltk) (4.9.1)
Requirement already satisfied: bottleneck in ./.local/lib/python3.6/site-packages (from inltk) (1.3.2)
Requirement already satisfied: fastprogress>=0.1.19 in ./.local/lib/python3.6/site-packages (from inltk) (1.0.0)
Requirement already satisfied: dataclasses in /usr/local/lib/python3.6/dist-packages (from inltk) (0.8)
Requirement already satisfied: nvidia-ml-py3 in ./.local/lib/python3.6/site-packages (from inltk) (7.352.0)
Requirement already satisfied: aiohttp>=3.5.4 in ./.local/lib/python3.6/site-packages (from inltk) (3.7.4.post0)
Requirement already satisfied: packaging in ./.local/lib/python3.6/site-packages (from inltk) (20.9)
Requirement already satisfied: sentencepiece in ./.local/lib/python3.6/site-packages (from inltk) (0.1.95)
Requirement already satisfied: matplotlib in ./.local/lib/python3.6/site-packages (from inltk) (3.3.4)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from inltk) (1.1.2)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: Pillow in ./.local/lib/python3.6/site-packages (from inltk) (8.1.2)
Requirement already satisfied: requests in ./.local/lib/python3.6/site-packages (from inltk) (2.24.0)
Requirement already satisfied: typing in ./.local/lib/python3.6/site-packages (from inltk) (3.7.4.3)
Requirement already satisfied: spacy>=2.0.18 in ./.local/lib/python3.6/site-packages (from inltk) (2.3.2)
Requirement already satisfied: numexpr in ./.local/lib/python3.6/site-packages (from inltk) (2.7.3)
Requirement already satisfied: scipy in ./.local/lib/python3.6/site-packages (from inltk) (1.4.1)
Requirement already satisfied: fastai==1.0.57 in ./.local/lib/python3.6/site-packages (from inltk) (1.0.57)
Requirement already satisfied: async-timeout>=3.0.1 in ./.local/lib/python3.6/site-packages (from inltk) (3.0.1)
Requirement already satisfied: idna-ssl>=1.0 in ./.local/lib/python3.6/site-packages (from aiohttp>=3.5.4->inltk) (1.1.0)
Requirement already satisfied: async-timeout>=3.0.1 in ./.local/lib/python3.6/site-packages (from inltk) (3.0.1)
Requirement already satisfied: chardet<5.0,>=2.0 in ./.local/lib/python3.6/site-packages (from aiohttp>=3.5.4->inltk) (3.0.4)
Requirement already satisfied: attrs>=17.3.0 in ./.local/lib/python3.6/site-packages (from aiohttp>=3.5.4->inltk) (20.3.0)
Requirement already satisfied: multidict<7.0,>=4.5 in ./.local/lib/python3.6/site-packages (from aiohttp>=3.5.4->inltk) (5.1.0)
Requirement already satisfied: yarl<2.0,>=1.0 in ./.local/lib/python3.6/site-packages (from aiohttp>=3.5.4->inltk) (1.6.3)
Requirement already satisfied: typing-extensions>=3.6.5 in /usr/local/lib/python3.6/dist-packages (from aiohttp>=3.5.4->inltk) (3.7.4.3)
Requirement already satisfied: soupsieve>1.2 in ./.local/lib/python3.6/site-packages (from beautifulsoup4->inltk) (2.0.1)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.6/dist-packages (from inltk) (3.13)
Requirement already satisfied: beautifulsoup4 in ./.local/lib/python3.6/site-packages (from inltk) (4.9.1)
Requirement already satisfied: bottleneck in ./.local/lib/python3.6/site-packages (from inltk) (1.3.2)
Requirement already satisfied: fastprogress>=0.1.19 in ./.local/lib/python3.6/site-packages (from inltk) (1.0.0)
Requirement already satisfied: torch>=1.0.0 in ./.local/lib/python3.6/site-packages (from fastai==1.0.57->inltk) (1.8.0)
Requirement already satisfied: nvidia-ml-py3 in ./.local/lib/python3.6/site-packages (from inltk) (7.352.0)
Requirement already satisfied: torchvision in ./.local/lib/python3.6/site-packages (from fastai==1.0.57->inltk) (0.9.0)
Requirement already satisfied: packaging in ./.local/lib/python3.6/site-packages (from inltk) (20.9)
Requirement already satisfied: dataclasses in /usr/local/lib/python3.6/dist-packages (from inltk) (0.8)
Requirement already satisfied: typing in ./.local/lib/python3.6/site-packages (from inltk) (3.7.4.3)
Requirement already satisfied: matplotlib in ./.local/lib/python3.6/site-packages (from inltk) (3.3.4)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from inltk) (1.1.2)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: Pillow in ./.local/lib/python3.6/site-packages (from inltk) (8.1.2)
Requirement already satisfied: requests in ./.local/lib/python3.6/site-packages (from inltk) (2.24.0)
Requirement already satisfied: spacy>=2.0.18 in ./.local/lib/python3.6/site-packages (from inltk) (2.3.2)
Requirement already satisfied: numexpr in ./.local/lib/python3.6/site-packages (from inltk) (2.7.3)
Requirement already satisfied: scipy in ./.local/lib/python3.6/site-packages (from inltk) (1.4.1)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: idna>=2.0 in ./.local/lib/python3.6/site-packages (from idna-ssl>=1.0->aiohttp>=3.5.4->inltk) (2.10)
Requirement already satisfied: Pillow in ./.local/lib/python3.6/site-packages (from inltk) (8.1.2)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->inltk) (2.8.1)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: cycler>=0.10 in ./.local/lib/python3.6/site-packages (from matplotlib->inltk) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./.local/lib/python3.6/site-packages (from matplotlib->inltk) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in ./.local/lib/python3.6/site-packages (from matplotlib->inltk) (2.4.7)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from cycler>=0.10->matplotlib->inltk) (1.15.0)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in ./.local/lib/python3.6/site-packages (from matplotlib->inltk) (2.4.7)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->inltk) (2020.1)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->inltk) (2.8.1)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from cycler>=0.10->matplotlib->inltk) (1.15.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in ./.local/lib/python3.6/site-packages (from requests->inltk) (1.25.10)
Requirement already satisfied: certifi>=2017.4.17 in ./.local/lib/python3.6/site-packages (from requests->inltk) (2020.6.20)
Requirement already satisfied: idna>=2.0 in ./.local/lib/python3.6/site-packages (from idna-ssl>=1.0->aiohttp>=3.5.4->inltk) (2.10)
Requirement already satisfied: chardet<5.0,>=2.0 in ./.local/lib/python3.6/site-packages (from aiohttp>=3.5.4->inltk) (3.0.4)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: requests in ./.local/lib/python3.6/site-packages (from inltk) (2.24.0)
Requirement already satisfied: thinc==7.4.1 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (7.4.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (2.0.4)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (0.8.0)
Requirement already satisfied: setuptools in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (50.3.2)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (1.0.4)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (4.50.2)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (1.0.4)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (1.1.3)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (3.0.4)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (1.0.0)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (0.4.1)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: importlib-metadata>=0.20 in ./.local/lib/python3.6/site-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.0.18->inltk) (2.0.0)
Requirement already satisfied: zipp>=0.5 in ./.local/lib/python3.6/site-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy>=2.0.18->inltk) (3.3.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (2.0.4)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (1.0.4)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (4.50.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (3.0.4)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (2.0.4)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (1.0.4)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (1.0.4)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (1.1.3)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (0.8.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (1.0.0)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in ./.local/lib/python3.6/site-packages (from spacy>=2.0.18->inltk) (0.4.1)
Requirement already satisfied: typing-extensions>=3.6.5 in /usr/local/lib/python3.6/dist-packages (from aiohttp>=3.5.4->inltk) (3.7.4.3)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: dataclasses in /usr/local/lib/python3.6/dist-packages (from inltk) (0.8)
Requirement already satisfied: torch>=1.0.0 in ./.local/lib/python3.6/site-packages (from fastai==1.0.57->inltk) (1.8.0)
Requirement already satisfied: numpy>=1.15 in ./.local/lib/python3.6/site-packages (from inltk) (1.19.5)
Requirement already satisfied: Pillow in ./.local/lib/python3.6/site-packages (from inltk) (8.1.2)
Requirement already satisfied: idna>=2.0 in ./.local/lib/python3.6/site-packages (from idna-ssl>=1.0->aiohttp>=3.5.4->inltk) (2.10)
Requirement already satisfied: multidict<7.0,>=4.5 in ./.local/lib/python3.6/site-packages (from aiohttp>=3.5.4->inltk) (5.1.0)
Requirement already satisfied: typing-extensions>=3.6.5 in /usr/local/lib/python3.6/dist-packages (from aiohttp>=3.5.4->inltk) (3.7.4.3)
WARNING: You are using pip version 20.3.1; however, version 21.0.1 is available.
You should consider upgrading via the '/usr/bin/python3.6 -m pip install --upgrade pip' command.

Setting the language

iNLTK has language models trained for different languages and in order to use one, we have to download its files first. We will be working with Hindi text, so let’s set “Hindi” as our language:

In [3]:
from inltk.inltk import setup
setup('hi')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-ea8b68f1926d> in <module>
      1 from inltk.inltk import setup
----> 2 setup('hi')

~/.local/lib/python3.6/site-packages/inltk/inltk.py in setup(language_code)
     31     loop = asyncio.get_event_loop()
     32     tasks = [asyncio.ensure_future(download(language_code))]
---> 33     learn = loop.run_until_complete(asyncio.gather(*tasks))[0]
     34     loop.close()
     35 

/usr/lib/python3.6/asyncio/base_events.py in run_until_complete(self, future)
    473         future.add_done_callback(_run_until_complete_cb)
    474         try:
--> 475             self.run_forever()
    476         except:
    477             if new_task and future.done() and not future.cancelled():

/usr/lib/python3.6/asyncio/base_events.py in run_forever(self)
    427         self._check_closed()
    428         if self.is_running():
--> 429             raise RuntimeError('This event loop is already running')
    430         if events._get_running_loop() is not None:
    431             raise RuntimeError(

RuntimeError: This event loop is already running
Done!

Tokenization

The first step we do to solve any NLP task is to break down the text into its smallest units or tokens. iNLTK supports tokenization of all the 12 languages I showed earlier:

In [4]:
from inltk.inltk import tokenize

hindi_text = """ कोरोना वायरस (सीओवी) का संबंध वायरस के ऐसे परिवार से है जिसके संक्रमण से जुकाम से लेकर सांस लेने में तकलीफ जैसी समस्या हो सकती है। इस वायरस को पहले कभी नहीं देखा गया है। इस वायरस का संक्रमण दिसंबर में चीन के वुहान में शुरू हुआ था। डब्लूएचओ के मुताबिक बुखार, खांसी, सांस लेने में तकलीफ इसके लक्षण हैं। अब तक इस वायरस को फैलने से रोकने वाला कोई टीका नहीं बना है।

इसके संक्रमण के फलस्वरूप बुखार, जुकाम, सांस लेने में तकलीफ, नाक बहना और गले में खराश जैसी समस्याएं उत्पन्न होती हैं। यह वायरस एक व्यक्ति से दूसरे व्यक्ति में फैलता है। इसलिए इसे लेकर बहुत सावधानी बरती जा रही है। यह वायरस दिसंबर में सबसे पहले चीन में पकड़ में आया था। इसके दूसरे देशों में पहुंच जाने की आशंका जताई जा रही है।

कोरोना से मिलते-जुलते वायरस खांसी और छींक से गिरने वाली बूंदों के ज़रिए फैलते हैं। कोरोना वायरस अब चीन में उतनी तीव्र गति से नहीं फ़ैल रहा है जितना दुनिया के अन्य देशों में फैल रहा है। कोविड 19 नाम का यह वायरस अब तक 70 से ज़्यादा देशों में फैल चुका है। कोरोना के संक्रमण के बढ़ते ख़तरे को देखते हुए सावधानी बरतने की ज़रूरत है ताकि इसे फैलने से रोका जा सके।"""

# tokenize(input text, language code)
tokenize(hindi_text, "hi")
Out[4]:
['▁को',
 'रो',
 'ना',
 '▁वायरस',
 '▁',
 '(',
 'सी',
 'ओ',
 'वी',
 ')',
 '▁का',
 '▁संबंध',
 '▁वायरस',
 '▁के',
 '▁ऐसे',
 '▁परिवार',
 '▁से',
 '▁है',
 '▁जिसके',
 '▁संक्रमण',
 '▁से',
 '▁जुकाम',
 '▁से',
 '▁लेकर',
 '▁सांस',
 '▁लेने',
 '▁में',
 '▁तकलीफ',
 '▁जैसी',
 '▁समस्या',
 '▁हो',
 '▁सकती',
 '▁है',
 '।',
 '▁इस',
 '▁वायरस',
 '▁को',
 '▁पहले',
 '▁कभी',
 '▁नहीं',
 '▁देखा',
 '▁गया',
 '▁है',
 '।',
 '▁इस',
 '▁वायरस',
 '▁का',
 '▁संक्रमण',
 '▁दिसंबर',
 '▁में',
 '▁चीन',
 '▁के',
 '▁वु',
 'हान',
 '▁में',
 '▁शुरू',
 '▁हुआ',
 '▁था',
 '।',
 '▁डब्लू',
 'एच',
 'ओ',
 '▁के',
 '▁मुताबिक',
 '▁बुखार',
 ',',
 '▁खांसी',
 ',',
 '▁सांस',
 '▁लेने',
 '▁में',
 '▁तकलीफ',
 '▁इसके',
 '▁लक्षण',
 '▁हैं',
 '।',
 '▁अब',
 '▁तक',
 '▁इस',
 '▁वायरस',
 '▁को',
 '▁फैलने',
 '▁से',
 '▁रोकने',
 '▁वाला',
 '▁कोई',
 '▁टीका',
 '▁नहीं',
 '▁बना',
 '▁है',
 '।',
 '▁इसके',
 '▁संक्रमण',
 '▁के',
 '▁फलस्वरूप',
 '▁बुखार',
 ',',
 '▁जुकाम',
 ',',
 '▁सांस',
 '▁लेने',
 '▁में',
 '▁तकलीफ',
 ',',
 '▁नाक',
 '▁बहन',
 'ा',
 '▁और',
 '▁गले',
 '▁में',
 '▁खरा',
 'श',
 '▁जैसी',
 '▁समस्याएं',
 '▁उत्पन्न',
 '▁होती',
 '▁हैं',
 '।',
 '▁यह',
 '▁वायरस',
 '▁एक',
 '▁व्यक्ति',
 '▁से',
 '▁दूसरे',
 '▁व्यक्ति',
 '▁में',
 '▁फैलता',
 '▁है',
 '।',
 '▁इसलिए',
 '▁इसे',
 '▁लेकर',
 '▁बहुत',
 '▁सावधानी',
 '▁बरत',
 'ी',
 '▁जा',
 '▁रही',
 '▁है',
 '।',
 '▁यह',
 '▁वायरस',
 '▁दिसंबर',
 '▁में',
 '▁सबसे',
 '▁पहले',
 '▁चीन',
 '▁में',
 '▁पकड़',
 '▁में',
 '▁आया',
 '▁था',
 '।',
 '▁इसके',
 '▁दूसरे',
 '▁देशों',
 '▁में',
 '▁पहुंच',
 '▁जाने',
 '▁की',
 '▁आशंका',
 '▁जताई',
 '▁जा',
 '▁रही',
 '▁है',
 '।',
 '▁को',
 'रो',
 'ना',
 '▁से',
 '▁मिलते',
 '-',
 'जुलते',
 '▁वायरस',
 '▁खांसी',
 '▁और',
 '▁छींक',
 '▁से',
 '▁गिरने',
 '▁वाली',
 '▁बूंद',
 'ों',
 '▁के',
 '▁ज़रिए',
 '▁फैलते',
 '▁हैं',
 '।',
 '▁को',
 'रो',
 'ना',
 '▁वायरस',
 '▁अब',
 '▁चीन',
 '▁में',
 '▁उतन',
 'ी',
 '▁तीव्र',
 '▁गति',
 '▁से',
 '▁नहीं',
 '▁फ़ैल',
 '▁रहा',
 '▁है',
 '▁जितना',
 '▁दुनिया',
 '▁के',
 '▁अन्य',
 '▁देशों',
 '▁में',
 '▁फैल',
 '▁रहा',
 '▁है',
 '।',
 '▁को',
 'विड',
 '▁19',
 '▁नाम',
 '▁का',
 '▁यह',
 '▁वायरस',
 '▁अब',
 '▁तक',
 '▁70',
 '▁से',
 '▁ज़्यादा',
 '▁देशों',
 '▁में',
 '▁फैल',
 '▁चुका',
 '▁है',
 '।',
 '▁को',
 'रो',
 'ना',
 '▁के',
 '▁संक्रमण',
 '▁के',
 '▁बढ़ते',
 '▁ख़',
 'तर',
 'े',
 '▁को',
 '▁देखते',
 '▁हुए',
 '▁सावधानी',
 '▁बरतन',
 'े',
 '▁की',
 '▁ज़रूरत',
 '▁है',
 '▁ताकि',
 '▁इसे',
 '▁फैलने',
 '▁से',
 '▁रोका',
 '▁जा',
 '▁सके',
 '।']

Splitting input text into sentences

Indic NLP Library supports many basic text processing tasks like normalization, tokenization at the word level, etc. But sentence level tokenization is what I find interesting because this is something that different Indian languages follow different rules for. Here is an example of how to use this sentence splitter:

In [14]:
!pip3 install indic-nlp-library
Defaulting to user installation because normal site-packages is not writeable
Collecting indic-nlp-library
  Downloading indic_nlp_library-0.71-py3-none-any.whl (38 kB)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from indic-nlp-library) (1.1.2)
Requirement already satisfied: numpy in ./.local/lib/python3.6/site-packages (from indic-nlp-library) (1.19.5)
Collecting morfessor
  Downloading Morfessor-2.0.6-py3-none-any.whl (35 kB)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->indic-nlp-library) (2020.1)
Requirement already satisfied: numpy in ./.local/lib/python3.6/site-packages (from indic-nlp-library) (1.19.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas->indic-nlp-library) (2.8.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.7.3->pandas->indic-nlp-library) (1.15.0)
Installing collected packages: morfessor, indic-nlp-library
Successfully installed indic-nlp-library-0.71 morfessor-2.0.6
WARNING: You are using pip version 20.3.1; however, version 21.0.1 is available.
You should consider upgrading via the '/usr/bin/python3.6 -m pip install --upgrade pip' command.
In [16]:
from indicnlp.tokenize import sentence_tokenize

hindi_text = """ कोरोना वायरस (सीओवी) का संबंध वायरस के ऐसे परिवार से है जिसके संक्रमण से जुकाम से लेकर सांस लेने में तकलीफ जैसी समस्या हो सकती है। इस वायरस को पहले कभी नहीं देखा गया है। इस वायरस का संक्रमण दिसंबर में चीन के वुहान में शुरू हुआ था। डब्लूएचओ के मुताबिक बुखार, खांसी, सांस लेने में तकलीफ इसके लक्षण हैं। अब तक इस वायरस को फैलने से रोकने वाला कोई टीका नहीं बना है।

इसके संक्रमण के फलस्वरूप बुखार, जुकाम, सांस लेने में तकलीफ, नाक बहना और गले में खराश जैसी समस्याएं उत्पन्न होती हैं। यह वायरस एक व्यक्ति से दूसरे व्यक्ति में फैलता है। इसलिए इसे लेकर बहुत सावधानी बरती जा रही है। यह वायरस दिसंबर में सबसे पहले चीन में पकड़ में आया था। इसके दूसरे देशों में पहुंच जाने की आशंका जताई जा रही है।

कोरोना से मिलते-जुलते वायरस खांसी और छींक से गिरने वाली बूंदों के ज़रिए फैलते हैं। कोरोना वायरस अब चीन में उतनी तीव्र गति से नहीं फ़ैल रहा है जितना दुनिया के अन्य देशों में फैल रहा है। कोविड 19 नाम का यह वायरस अब तक 70 से ज़्यादा देशों में फैल चुका है। कोरोना के संक्रमण के बढ़ते ख़तरे को देखते हुए सावधानी बरतने की ज़रूरत है ताकि इसे फैलने से रोका जा सके।"""

# Split the sentence, language code "hi" is passed for hingi
sentences=sentence_tokenize.sentence_split(hindi_text, lang='hi')

# print the sentences
for t in sentences:
    print(t)
कोरोना वायरस (सीओवी) का संबंध वायरस के ऐसे परिवार से है जिसके संक्रमण से जुकाम से लेकर सांस लेने में तकलीफ जैसी समस्या हो सकती है।
इस वायरस को पहले कभी नहीं देखा गया है।
इस वायरस का संक्रमण दिसंबर में चीन के वुहान में शुरू हुआ था।
डब्लूएचओ के मुताबिक बुखार, खांसी, सांस लेने में तकलीफ इसके लक्षण हैं।
अब तक इस वायरस को फैलने से रोकने वाला कोई टीका नहीं बना है।
इसके संक्रमण के फलस्वरूप बुखार, जुकाम, सांस लेने में तकलीफ, नाक बहना और गले में खराश जैसी समस्याएं उत्पन्न होती हैं।
यह वायरस एक व्यक्ति से दूसरे व्यक्ति में फैलता है।
इसलिए इसे लेकर बहुत सावधानी बरती जा रही है।
यह वायरस दिसंबर में सबसे पहले चीन में पकड़ में आया था।
इसके दूसरे देशों में पहुंच जाने की आशंका जताई जा रही है।
कोरोना से मिलते-जुलते वायरस खांसी और छींक से गिरने वाली बूंदों के ज़रिए फैलते हैं।
कोरोना वायरस अब चीन में उतनी तीव्र गति से नहीं फ़ैल रहा है जितना दुनिया के अन्य देशों में फैल रहा है।
कोविड 19 नाम का यह वायरस अब तक 70 से ज़्यादा देशों में फैल चुका है।
कोरोना के संक्रमण के बढ़ते ख़तरे को देखते हुए सावधानी बरतने की ज़रूरत है ताकि इसे फैलने से रोका जा सके।
In [ ]:

Resources You Will Ever Need