Reddit is a holy grail of training data for dialogue systems. But the sheer size of the available data makes it very difficult to start working with the data immediately. To get a head start for using Reddit threads as training data for chatbots, I wrote some scripts to convert Reddit dumps into simple .txt files. These can be easily imported into any text dataloader pipeline without any boilerplate code for reading / preprocessing the raw data. You can find the codes in the bsantraigi/RedditDialogue repo

Downloading the Reddit Dumps

You would first need to download the Reddit Dumps for comments and posts from

Thanks to /u/Stuck_In_the_Matrix for gathering all these data together.

How to run

  • Create a data folder
  • Put RS and RC file pairs for same months
  • To generate train and test dialogues you can use something like following
python --file_tag 2017-03 --output_folder sample_data/train --n_posts 500
python --file_tag 2011-03 --output_folder sample_data/test --n_posts 500

Sample data

Sample data folder contains dialogues extracted from following two files.

  • ‘train’ from 2017-03
  • ‘test’ from 2011-03