Skip to content

Commit 953c28a

Browse files
author
Jonathan Kummerfeld
authored
Merge pull request #33 from jkkummerfeld/data-development
Data development
2 parents 5da2527 + e76b98b commit 953c28a

23 files changed

+517575
-1614
lines changed

LICENSE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,4 @@ Files | License
77
`systems/baseline-template` | [Apache License 2.0](./systems/baseline-template/LICENSE.txt)
88
`systems/sequence-to-sequence` | [Apache License 2.0](./systems/sequence-to-sequence/LICENSE.txt)
99
`data/advising*` | [CC-BY-4.0](./data/advising-LICENSE.txt)
10-
10+
`data/wikisql*` | [BSD](./data/wikisql-LICENSE.txt)

README.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ We have separate files describing the [datasets](./data/), [systems](./systems/)
1717

1818
Version | Description
1919
------- | -------------
20+
3 | Data fixes and addition of data from Spider and WikiSQL
2021
2 | Data with fixes for variables incorrectly defined in questions
2122
1 | Data used in the ACL 2018 paper
2223

@@ -26,7 +27,7 @@ If you use this data in your work, please cite our ACL paper _and_ the appropria
2627
For example, in your paper you could write (using the BibTeX below):
2728

2829
```
29-
In this work, we use version 2 of the modified SQL datasets from \citet{data-advising}, based on \citet{data-academic,data-atis-original,data-geography-original,data-atis-geography-scholar,data-imdb-yelp,data-restaurants-logic,data-restaurants-original,data-restaurants}
30+
In this work, we use version 3 of the modified SQL datasets from \citet{data-advising}, based on \citet{data-academic,data-atis-original,data-geography-original,data-atis-geography-scholar,data-imdb-yelp,data-restaurants-logic,data-restaurants-original,data-restaurants,data-spider,data-wikisql}
3031
```
3132

3233
If you are only using one dataset, here are example citation commands:
@@ -39,8 +40,10 @@ ATIS | `\citet{data-advising,data-atis-original,data-atis-geography-scho
3940
Geography | `\citet{data-advising,data-geography-original,data-atis-geography-scholar}`
4041
Restaurants | `\citet{data-advising,data-restaurants-logic,data-restaurants-original,data-restaurants}`
4142
Scholar | `\citet{data-advising,data-atis-geography-scholar}`
43+
Spider | `\citet{data-advising,data-spider}`
4244
IMDB | `\citet{data-advising,data-imdb-yelp}`
4345
Yelp | `\citet{data-advising,data-imdb-yelp}`
46+
WikiSQL | `\citet{data-advising,data-wikisql}`
4447

4548
```TeX
4649
@InProceedings{data-sql-advising,
@@ -140,6 +143,24 @@ Yelp | `\citet{data-advising,data-imdb-yelp}`
140143
pages = {59--76},
141144
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
142145
}
146+
147+
@InProceedings{data-spider,
148+
author = {Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev},
149+
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
150+
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
151+
year = {2018},
152+
location = {Brussels, Belgium},
153+
pages = {3911--3921},
154+
url = {http://aclweb.org/anthology/D18-1425},
155+
}
156+
157+
@article{data-wikisql,
158+
author = {Victor Zhong, Caiming Xiong, and Richard Socher},
159+
title = {Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning},
160+
year = {2017},
161+
journal = {CoRR},
162+
volume = {abs/1709.00103},
163+
}
143164
```
144165

145166
# Contributions
@@ -153,4 +174,5 @@ For some ideas of issues to address, see our list of [known issues](./known-issu
153174

154175
# Acknowledgments
155176

156-
This material is based in part upon work supported by IBM under contract 4915012629. Any opinions, findings, conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of IBM.
177+
This material is based in part upon work supported by IBM under contract 4915012629.
178+
Any opinions, findings, conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of IBM.

data/READ-history.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,9 @@ geography | [Iyer et al., 2017](http://aclweb.org/anthology/P/P17/P17-1089.pd
1111
imdb | [Yaghmazadeh et al., 2017](http://doi.org/10.1145/3133887) | [UT](https://drive.google.com/drive/folders/0B-2uoWxAwJGKY09kaEtTZU1nTWM)
1212
restaurants | [Popescu et al., 2003](https://doi.org/10.1007/978-3-642-45260-4_5) | [Trento](https://ikernels-portal.disi.unitn.it/repository/semmap/)
1313
scholar | [Iyer et al., 2017](http://aclweb.org/anthology/P/P17/P17-1089.pdf) | [UW](https://github.com/sriniiyer/nl2sql/tree/master/data)
14+
spider | [Yu et al., 2018)](http://aclweb.org/anthology/D18-1425) | [Yale](https://yale-lily.github.io/spider)
1415
yelp | [Yaghmazadeh et al., 2017](http://doi.org/10.1145/3133887) | [UT](https://drive.google.com/drive/folders/0B-2uoWxAwJGKY09kaEtTZU1nTWM)
16+
wikisql | [Zhong et al., 2017](https://arxiv.org/pdf/1709.00103.pdf) | [Salesforce](https://github.com/salesforce/WikiSQL)
1517

1618
## academic
1719

@@ -56,6 +58,11 @@ We have corrected some minor issues in the data:
5658

5759
Constructed at UW in 2017
5860

61+
## spider
62+
63+
1. Combination of data from this repository (1,659 queries) and new data (8,034 queries) across a large set of tabe=les.
64+
2. SQL canonicalised and variables detected automatically by us
65+
5966
## yelp and imdb
6067

6168
Constructed at UT Austin in 2017
@@ -68,3 +75,8 @@ Note - in the imdb dataset there are some cases where multiple SQL queries are p
6875
"select actor_0.nationality from actor as actor_0 where actor_0.name = \" Ben Affleck \" "
6976
```
7077

78+
## wikiSQL
79+
80+
1. Data collected by Salesforce
81+
2. SQL converted to our format and duplicate queries detected by us
82+

data/README.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,14 @@ For each dataset we provide:
1414
- `*-fields.txt`, a list of fields in the database
1515
- `*-schema.csv`, key information about each database field
1616

17-
Four of the databases are not included in the repository either for size or licensing reasons.
17+
Some of the databases are not included in the repository either for size or licensing reasons.
1818
They can be found as follows:
1919

2020
Dataset | Database
2121
-------- | ----------
2222
Academic (MAS), IMDB, Yelp | [https://drive.google.com/drive/folders/0B-2uoWxAwJGKY09kaEtTZU1nTWM](https://drive.google.com/drive/folders/0B-2uoWxAwJGKY09kaEtTZU1nTWM)
2323
Scholar | [https://drive.google.com/file/d/0Bw5kFkY8RRXYRXdYYlhfdXRlTVk](https://drive.google.com/file/d/0Bw5kFkY8RRXYRXdYYlhfdXRlTVk)
24+
Spider | [https://yale-lily.github.io/spider](https://yale-lily.github.io/spider)
2425

2526
For more information about the sources of data see the [READ-history.md](./READ-history.md) file.
2627

@@ -50,6 +51,23 @@ variables/example | string | An example value that could fill
5051
variables/name | string | The variable name
5152
variables/type | string | Dataset specific type
5253

54+
For WikiSQL and Spider we have a few additional fields:
55+
56+
Symbol | Type | Meaning
57+
------------------ | ----------------- | -----------------------------
58+
sentences/original | string | The question from the original dataset, before our simple tokenisation
59+
sentences/database | string | The name of the database this question is for [Spider]
60+
sentences/table-id | string | The name of the table this question is for [WikiSQL]
61+
sql-original | list of strings | The query from the original dataset, before our canonicalisation
62+
63+
Also, there are a few caveats:
64+
65+
- There is no query split.
66+
- For Spider the test set is currently not available here (it is being kept secret by the original creators).
67+
- We did automatic variable identification and spot checked some of it (if you find issues, let us know!).
68+
- We modified our canonicalisation script to convert the data and spot checked the output (again, if you find issues, let us know!).
69+
- For WikiSQL we substituted in the actual field names, which contain all sorts of characters, so the SQL is not always valid. This substitution also means the 'sql-original' column numbers may be incorrect for some of the examples as we merged based on the final SQL.
70+
5371
Example:
5472

5573
```

data/advising-schema.csv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Table Name, Field Name, Is Primary Key, Is Foreign Key, Type
1+
Table Name, Field Name, Type, Is Foreign Key, Is Primary Key
22
AREA, course_id, int(11), YES, -, NULL, -
33
AREA, area, varchar(30), YES, -, NULL, -
44
-, -, -, -, -, -, -

data/advising.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4186,7 +4186,7 @@
41864186
}
41874187
],
41884188
"sql": [
4189-
"SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER FROM COURSE AS COURSEalias0 WHERE COURSEalias0.DEPARTMENT = \"department0\" AND COURSEalias0.HAS_LAB = \"N\" AND ;"
4189+
"SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER FROM COURSE AS COURSEalias0 WHERE COURSEalias0.DEPARTMENT = \"department0\" AND COURSEalias0.HAS_LAB = \"N\" ;"
41904190
],
41914191
"variables": [
41924192
{
@@ -15067,7 +15067,7 @@
1506715067
}
1506815068
],
1506915069
"sql": [
15070-
"SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER FROM COURSE AS COURSEalias0 WHERE COURSEalias0.DEPARTMENT = \"department0\" COURSEalias0.CREDITS = credit0 ;"
15070+
"SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER FROM COURSE AS COURSEalias0 WHERE COURSEalias0.DEPARTMENT = \"department0\" AND COURSEalias0.CREDITS = credit0 ;"
1507115071
],
1507215072
"variables": [
1507315073
{
@@ -25745,7 +25745,7 @@
2574525745
}
2574625746
],
2574725747
"sql": [
25748-
"SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER , PROGRAM_COURSEalias0.WORKLOAD FROM COURSE AS COURSEalias0 , PROGRAM_COURSE AS PROGRAM_COURSEalias0 WHERE COURSEalias0.DEPARTMENT LIKE \"%department0%\" AND PROGRAM_COURSEalias0.COURSE_ID = COURSEalias0.COURSE_ID AND PROGRAM_COURSEalias0.WORKLOAD < 3 ; ;"
25748+
"SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER , PROGRAM_COURSEalias0.WORKLOAD FROM COURSE AS COURSEalias0 , PROGRAM_COURSE AS PROGRAM_COURSEalias0 WHERE COURSEalias0.DEPARTMENT LIKE \"%department0%\" AND PROGRAM_COURSEalias0.COURSE_ID = COURSEalias0.COURSE_ID AND PROGRAM_COURSEalias0.WORKLOAD < 3 ;"
2574925749
],
2575025750
"variables": [
2575125751
{

0 commit comments

Comments
 (0)