You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+24-2Lines changed: 24 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,6 +17,7 @@ We have separate files describing the [datasets](./data/), [systems](./systems/)
17
17
18
18
Version | Description
19
19
------- | -------------
20
+
3 | Data fixes and addition of data from Spider and WikiSQL
20
21
2 | Data with fixes for variables incorrectly defined in questions
21
22
1 | Data used in the ACL 2018 paper
22
23
@@ -26,7 +27,7 @@ If you use this data in your work, please cite our ACL paper _and_ the appropria
26
27
For example, in your paper you could write (using the BibTeX below):
27
28
28
29
```
29
-
In this work, we use version 2 of the modified SQL datasets from \citet{data-advising}, based on \citet{data-academic,data-atis-original,data-geography-original,data-atis-geography-scholar,data-imdb-yelp,data-restaurants-logic,data-restaurants-original,data-restaurants}
30
+
In this work, we use version 3 of the modified SQL datasets from \citet{data-advising}, based on \citet{data-academic,data-atis-original,data-geography-original,data-atis-geography-scholar,data-imdb-yelp,data-restaurants-logic,data-restaurants-original,data-restaurants,data-spider,data-wikisql}
30
31
```
31
32
32
33
If you are only using one dataset, here are example citation commands:
author = {Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev},
149
+
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
150
+
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
151
+
year = {2018},
152
+
location = {Brussels, Belgium},
153
+
pages = {3911--3921},
154
+
url = {http://aclweb.org/anthology/D18-1425},
155
+
}
156
+
157
+
@article{data-wikisql,
158
+
author = {Victor Zhong, Caiming Xiong, and Richard Socher},
159
+
title = {Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning},
160
+
year = {2017},
161
+
journal = {CoRR},
162
+
volume = {abs/1709.00103},
163
+
}
143
164
```
144
165
145
166
# Contributions
@@ -153,4 +174,5 @@ For some ideas of issues to address, see our list of [known issues](./known-issu
153
174
154
175
# Acknowledgments
155
176
156
-
This material is based in part upon work supported by IBM under contract 4915012629. Any opinions, findings, conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of IBM.
177
+
This material is based in part upon work supported by IBM under contract 4915012629.
178
+
Any opinions, findings, conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of IBM.
sentences/original | string | The question from the original dataset, before our simple tokenisation
59
+
sentences/database | string | The name of the database this question is for [Spider]
60
+
sentences/table-id | string | The name of the table this question is for [WikiSQL]
61
+
sql-original | list of strings | The query from the original dataset, before our canonicalisation
62
+
63
+
Also, there are a few caveats:
64
+
65
+
- There is no query split.
66
+
- For Spider the test set is currently not available here (it is being kept secret by the original creators).
67
+
- We did automatic variable identification and spot checked some of it (if you find issues, let us know!).
68
+
- We modified our canonicalisation script to convert the data and spot checked the output (again, if you find issues, let us know!).
69
+
- For WikiSQL we substituted in the actual field names, which contain all sorts of characters, so the SQL is not always valid. This substitution also means the 'sql-original' column numbers may be incorrect for some of the examples as we merged based on the final SQL.
Copy file name to clipboardExpand all lines: data/advising.json
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -4186,7 +4186,7 @@
4186
4186
}
4187
4187
],
4188
4188
"sql": [
4189
-
"SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER FROM COURSE AS COURSEalias0 WHERE COURSEalias0.DEPARTMENT = \"department0\" AND COURSEalias0.HAS_LAB = \"N\" AND ;"
4189
+
"SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER FROM COURSE AS COURSEalias0 WHERE COURSEalias0.DEPARTMENT = \"department0\" AND COURSEalias0.HAS_LAB = \"N\" ;"
4190
4190
],
4191
4191
"variables": [
4192
4192
{
@@ -15067,7 +15067,7 @@
15067
15067
}
15068
15068
],
15069
15069
"sql": [
15070
-
"SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER FROM COURSE AS COURSEalias0 WHERE COURSEalias0.DEPARTMENT = \"department0\" COURSEalias0.CREDITS = credit0 ;"
15070
+
"SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER FROM COURSE AS COURSEalias0 WHERE COURSEalias0.DEPARTMENT = \"department0\" AND COURSEalias0.CREDITS = credit0 ;"
15071
15071
],
15072
15072
"variables": [
15073
15073
{
@@ -25745,7 +25745,7 @@
25745
25745
}
25746
25746
],
25747
25747
"sql": [
25748
-
"SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER , PROGRAM_COURSEalias0.WORKLOAD FROM COURSE AS COURSEalias0 , PROGRAM_COURSE AS PROGRAM_COURSEalias0 WHERE COURSEalias0.DEPARTMENT LIKE \"%department0%\" AND PROGRAM_COURSEalias0.COURSE_ID = COURSEalias0.COURSE_ID AND PROGRAM_COURSEalias0.WORKLOAD < 3 ; ;"
25748
+
"SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER , PROGRAM_COURSEalias0.WORKLOAD FROM COURSE AS COURSEalias0 , PROGRAM_COURSE AS PROGRAM_COURSEalias0 WHERE COURSEalias0.DEPARTMENT LIKE \"%department0%\" AND PROGRAM_COURSEalias0.COURSE_ID = COURSEalias0.COURSE_ID AND PROGRAM_COURSEalias0.WORKLOAD < 3 ;"
0 commit comments