Skip to content

Commit 3a3dc21

Browse files
committed
Include datasets in README.md.
1 parent 32f9b99 commit 3a3dc21

File tree

1 file changed

+13
-1
lines changed

1 file changed

+13
-1
lines changed

README.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -285,8 +285,20 @@ If you have installed the source code, you can customize Co-STORM based on your
285285
1. Co-STORM introduces multiple LLM agent types (i.e. Co-STORM experts and Moderator). LLM agent interface is defined in `knowledge_storm/interface.py` , while its implementation is instantiated in `knowledge_storm/collaborative_storm/modules/co_storm_agents.py`. Different LLM agent policies can be customized.
286286
2. Co-STORM introduces a collaborative discourse protocol, with its core function centered on turn policy management. We provide an example implementation of turn policy management through `DiscourseManager` in `knowledge_storm/collaborative_storm/engine.py`. It can be customized and further improved.
287287
288+
## Datasets
289+
To facilitate the study of automatic knowledge curation and complex information seeking, our project releases the following datasets:
288290
289-
## Replicate Replicate STORM & Co-STORM paper result
291+
### FreshWiki
292+
The FreshWiki Dataset is a collection of 100 high-quality Wikipedia articles focusing on the most-edited pages from February 2022 to September 2023. See Section 2.1 in [STORM paper](https://arxiv.org/abs/2402.14207) for more details.
293+
294+
You can download the dataset from [huggingface](https://huggingface.co/datasets/EchoShao8899/FreshWiki) directly. To ease the data contamination issue, we archive the [source code](https://github.com/stanford-oval/storm/tree/NAACL-2024-code-backup/FreshWiki) for the data construction pipeline that can be repeated at future dates.
295+
296+
### WildSeek
297+
To study users’ interests in complex information seeking tasks in the wild, we utilized data collected from the web research preview to create the WildSeek dataset. We downsampled the data to ensure the diversity of the topics and the quality of the data. Each data point is a pair comprising a topic and the user’s goal for conducting deep search on the topic. For more details, please refer to Section 2.2 and Appendix A of [Co-STORM paper](https://www.arxiv.org/abs/2408.15232).
298+
299+
The WildSeek dataset is available [here](https://huggingface.co/datasets/YuchengJiang/WildSeek).
300+
301+
## Replicate STORM & Co-STORM paper result
290302
291303
For STORM paper experiments, please switch to the branch `NAACL-2024-code-backup` [here](https://github.com/stanford-oval/storm/tree/NAACL-2024-code-backup).
292304

0 commit comments

Comments
 (0)