Hindi-English Code-Switching Dataset

This dataset has text transcriptions and Hindi-English code-switched segment markings for audio recordings of two well-known Indian personalities - BK Shivani, a spiritual speaker, and Alia Bhatt, a film star. Each recording is from a 7-15 minute long YouTube video.

The transcriptions were written manually into .txt files as runnning text(without any punctuation) and are not time-aligned with the audios. Words spoken in Hindi in the audio appear in Devanagari script, and those in English in Roman script.

Code-switched segments were marked in 2 tiers of a Praat textgrid^[1] file and are time-aligned with the audios(Praat is a free computer software package for the scientific analysis of speech in phonetics^[2]). They are indicative of four kinds of code-switching:

H - segments with all Hindi words in Hindi syntax
H(E) - segments with some English words in Hindi syntax
E - segments with all English words in English syntax
E(H) - segments with some Hindi words in English syntax

Tier 1 has each of these regions marked separately, while tier 2 is a reduced form with H and H(E) combined as H, and E and E(H) combined as E. Besides, silences longer than some threshold duration are marked S.

The videos corresponding to each of the transcription files in the dataset can be found on YouTube at the following links:

Spiritual lectures by Brahmakumari Shivani didi (Mount Abu) -
- Destiny_is_YOUR_CHOICE_BK_Shivani.txt - YouTube Video
- Is_Everything_PreDestined_BK_Shivani.txt - YouTube Video
- Live_The_Life_You_Desire_BK_Shivani.txt - YouTube Video
- Mistakes_Need_Love_BK_Shivani.txt - YouTube Video
- Strengthen_My_Relationships_BK_Shivani.txt - YouTube Video
Interview on a film-release by Alia Bhatt -
- Alia1.txt - YouTube Video
- Alia2.txt - YouTube Video

Organisation of the Dataset
The dataset is organised as follows:

‘Transcriptions’ folder - contains all the .txt transcription files
‘Textgrids’ folder - contains all the .Textgrid files with code-switched segments

This dataset is being made available for non-commercial research purposes only. Please do cite this paper if you happen to use this dataset in your research:

P. Rao, M. Pandya, K. Sabu, K. Kumar and N. Bondale "A Study of Lexical and Prosodic Cues to Segmentation in a Hindi-English Code-switched Discourse", Proc. of Interspeech, Sep 2018, Hyderabad, India.

Please contact us at this email id for a copy of the dataset:
prao@ee.iitb.ac.in
(Prof. Preeti Rao, Dept of EE, IITB)

References
[1] http://www.fon.hum.uva.nl/praat/manual/Intro_7__Annotation.html
[2] http://www.fon.hum.uva.nl/praat/

Cite as:
P. Rao, M. Pandya, K. Sabu, K. Kumar and N. Bondale "A Study of Lexical and Prosodic Cues to Segmentation in a Hindi-English Code-switched Discourse", Proc. of Interspeech, Sep 2018, Hyderabad, India.