Skip to content

Latest commit

 

History

History
30 lines (14 loc) · 896 Bytes

File metadata and controls

30 lines (14 loc) · 896 Bytes

TextSummarization

Generating Dataset for Google's Text Summarization Code by Xin Pan and Peter Liu

Repository Link: https://github.com/tensorflow/models/tree/master/research/textsum

Dataset can be obtained here: CNN and DailyMail stories http://cs.nyu.edu/~kcho/DMQA/

Working:

The valid data format requires article and abstract key for the TextSum algorithm to train and decode.

Both articles and abstracts are tagged for sentence, paragraph and document start and end.

abstract is extracted using all @highlights in data.

Vocabulary with 200000 words include UNK and PAD tokens are generated.

Usage:

CNN and DailyMail data should be present in %pwd%/cnn/stories and %pwd%/dailymail/stories

run mkdir data in the present working directory

You can opt for generating both Datasets or one of them using the following arguments-

run python convertdata.py --both or --CNN or --DM