Hindi-English Code-Switching Dataset
The transcriptions were written manually into .txt files as runnning text(without any punctuation) and are not time-aligned with the audios. Words spoken in Hindi in the audio appear in Devanagari script, and those in English in Roman script.
Code-switched segments were marked in 2 tiers of a Praat textgrid[1] file and are time-aligned with the audios(Praat is a free computer software package for the scientific analysis of speech in phonetics[2]). They are indicative of four kinds of code-switching:
- H - segments with all Hindi words in Hindi syntax
- H(E) - segments with some English words in Hindi syntax
- E - segments with all English words in English syntax
- E(H) - segments with some Hindi words in English syntax
The videos corresponding to each of the transcription files in the dataset can be found on YouTube at the following links:
- Spiritual lectures by Brahmakumari Shivani didi (Mount Abu) -
- Destiny_is_YOUR_CHOICE_BK_Shivani.txt - YouTube Video
- Is_Everything_PreDestined_BK_Shivani.txt - YouTube Video
- Live_The_Life_You_Desire_BK_Shivani.txt - YouTube Video
- Mistakes_Need_Love_BK_Shivani.txt - YouTube Video
- Strengthen_My_Relationships_BK_Shivani.txt - YouTube Video
- Interview on a film-release by Alia Bhatt -
- Alia1.txt - YouTube Video
- Alia2.txt - YouTube Video
The dataset is organised as follows:
- ‘Transcriptions’ folder - contains all the .txt transcription files
- ‘Textgrids’ folder - contains all the .Textgrid files with code-switched segments
P. Rao, M. Pandya, K. Sabu, K. Kumar and N. Bondale "A Study of Lexical and Prosodic Cues to Segmentation in a Hindi-English Code-switched Discourse", Proc. of Interspeech, Sep 2018, Hyderabad, India.
Please contact us at this email id for a copy of the dataset:
prao@ee.iitb.ac.in
(Prof. Preeti Rao, Dept of EE, IITB)
References
[1] http://www.fon.hum.uva.nl/praat/manual/Intro_7__Annotation.html
[2] http://www.fon.hum.uva.nl/praat/
Cite as:
P. Rao, M. Pandya, K. Sabu, K. Kumar and N. Bondale "A Study of Lexical and Prosodic Cues to Segmentation in a Hindi-English Code-switched Discourse", Proc. of Interspeech, Sep 2018, Hyderabad, India.