Skip to content

Commit

Permalink
Various Changes
Browse files Browse the repository at this point in the history
Update main page.
Remove the pile sub page.
Update readme.
Update openwebtext2 sub page.
  • Loading branch information
researcher2 committed Jan 22, 2021
1 parent e5d1f8d commit acfaf15
Show file tree
Hide file tree
Showing 4 changed files with 50 additions and 38 deletions.
66 changes: 38 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,51 +2,61 @@

This is the new website for EleutherAI based on Hugo, a static site generator. The content should correspond to existing google sites website but with added blog and other features. Please make sure to familiarize with the basics of working with Hugo before you start using it.


## How it works?
## Setup
1. [install hugo](https://gohugo.io/getting-started/installing/)
2. clone this repository (and make sure you are in the `master` branch)
3. get git submodules which serve as the generated website in the public folders (/public, /public-blog ...etc): `git submodule update --init` and then make sure that you have set them all to `master` branch.
4. now you can try to run hugo locally `hugo server -D`

if everything is working feel at this point feel free to start working on the website.
## Project Layout

Once you are done with your work you cant now try to publish the changes. To do that you need to push the git changes both for the repo and for /public folder that is set as git submodule for the website html content.
| Directory | Description |
| -----------: | ----------- |
| `content` | Underyling content for the main site|
| `content-blog` | Underyling content for the blog |
| `static/images` | Images for both sites. |
| `themes/eai-theme` | We use a single theme for both the main site and the blog. |
| `public` | Contains the main site build |
| `public-blog` | Build for the blog.|

Easiest way to do it is to run deploy scripts:
`./deploy.sh` and then just commit changes frgom the main repo.
## How to display 2 containers that are horizontally aligned?

If that doesnt work you can do it manually:
1. add an empty header markdown with the class `content-block` -> `## ## {class="content-block"}`
2. after that line, add 2 containers as 2 elements of the list. In CSS it is defined that first list `<ul>` below content-block header will display items horizontally. (only the first one, any other list elements will be displayed as expected)

1. go to public folder `cd public`
2. commit the changes and push them. `git add`, `git commit -m [commit name for submodule]`, `git push`
3. go back to the main repo `cd ..` and commit+push your changes there too `git add`, `git commit -m [commit name for project]`, `git push`
## Dev Environment

***Note: based on your user settings you might not have privileges to do changes in /public folder. In that case you can still do all the previous steps with `sudo` command.***
To run the development server on localhost for the main site:

So it will be `sudo ./deploy.sh`, `sudo git commit `.. etc.
`hugo server -D`

## Blog
To use the blog, the instructions are similar like in the previous section with few differences. The blog markdown content is served from content-blog, the submodule repo is public-blog and it uses different config file: config-blog.toml and different deploy script deploy_blog.sh.
To load the blog instead:

1. change the content in content-blog
2. when deploying run: `./deploy_blog.sh`
3. commit/push the changes for the main rep
`hugo server -D --config config-blog.toml`

### Update: now there is the script that should do all of this for you. You just need to run:
To bind on another IP apart from localhost and change the baseURL (ensuring the links work):

`./run_all.sh`
`hugo server --bind=BIND-IP --baseUrl=IP-OR-DOMAIN -D`

***(before running the script make sure all git submodules are in master branch as otherwise it wont push)***
If everything is working feel at this point feel free to start working on the website. Once you are happy with the changes, perform the build as explained below.

### and it will generate all the sites and deploy them. (both for website and blog).
## Building And Pushing

We are using submodules for the site builds (public and public-blog) so these need to be built and pushed separately to the underlying template and content changes.

We have created build scripts to make this process easier:

**Main Site:** `./deploy.sh`
**Blog:** `./deploy_blog.sh`
**Both:** `./run_all.sh`

Afterwards you can separately push your underlying changes.

***Note: based on your user settings you might not have privileges to do changes in /public folder. In that case you can still do all the previous steps with `sudo` command.***

So it will be `sudo ./deploy.sh`, `sudo git commit `.. etc.

***(before running the script make sure all git submodules are in master branch as otherwise it wont push)***

## Editing content
The theme and content structure should be similar to the standard Hugo projects. Content for main pages is in `/content folder`. The Blog is in `/content/blog`.

`content/projects` is the markdown content for the project pages (GPT-NEO, The pile), `content/project-intros` are small chunks of the project contents that are displayed on the home page.

### How to display 2 containers that are horizontally aligned?

1. add an empty header markdown with the class `content-block` -> `## ## {class="content-block"}`
2. after that line, add 2 containers as 2 elements of the list. In CSS it is defined that first list `<ul>` below content-block header will display items horizontally. (only the first one, any other list elements will be displayed as expected)
6 changes: 3 additions & 3 deletions content/projects-intros/open-web-text2.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ date: 2019-04-26T20:18:54+03:00
project_image: "images/open-web-text2.png"
---

## [Open Web Text 2](projects/open-web-text2/)
## [OpenWebText2](projects/open-web-text2/)

The core principle of WebText is to build a high-quality internet dataset by extracting URLs from Reddit submissions, scraping the URLs, and then performing filtering for quality (by upvotes) & deduplication. As the dataset collected for training the original GPT-2 is not public, researchers independently reproduced the pipeline and released the resulting dataset, called OpenWebTextCorpus (OWT).
OpenWebText2 is a dataset inspired by WebText, created by scraping URLs extracted from Reddit submissions up until April 2020 with a minimum score of 3 as a proxy for quality.

OpenWebText2 (OWT2) is an enhanced version of the original OpenWebTextCorpus covering all Reddit submissions from 2005 up until April 2020, with further months becoming available after the corresponding PushShift dump files are released.
It features content from multiple languages, document metadata, multiple dataset versions, and open source replication code.
6 changes: 4 additions & 2 deletions content/projects-intros/the-pile.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@ date: 2019-03-26T20:18:54+03:00
project_image: "images/the-pile.png"
---

## [The Pile](projects/pile/)
## [The Pile](https://pile.eleuther.ai/)

The Pile is a large, diverse, open source language modelling data set that consists of many smaller datasets combined together. The objective is to obtain text from as many modalities as possible to ensure that models trained using The Pile will have much broader generalization abilities.
The Pile is an **825 GiB** diverse, open source language modelling dataset consisting of data from 22 high quality sources. It is useful for both training and benchmarking large language models.

In our evaluations, models trained on the Pile show moderate improvements in traditional language modeling benchmarks, along with significant improvements on Pile BPB (our benchmarking measure).

The Pile is now complete! Check it out [here](https://pile.eleuther.ai/).
10 changes: 5 additions & 5 deletions content/projects/open-web-text2.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
---
title: "Open Web Text 2"
title: "OpenWebText2"
date: 2019-04-26T20:18:54+03:00
layout: page
---

## ## {class="content-block"}
- ![alt](../../images/open-web-text2.png)
- ## Open Web Text 2
WebText is an internet dataset created by extracting URLs from Reddit submissions and scraping the URLs. It was collected for training the original GPT-2 and never released to the public, researchers independently reproduced the pipeline and released the resulting dataset, called [OpenWebTextCorpus (OWT)](https://skylion007.github.io/OpenWebTextCorpus/).
- ## OpenWebText2
WebText is an internet dataset created by scraping URLs extracted from Reddit submissions with a minimum score of 3 as a proxy for quality. It was collected for training the original GPT-2 and never released to the public, however researchers independently reproduced the pipeline and released the resulting dataset, called [OpenWebTextCorpus (OWT)](https://skylion007.github.io/OpenWebTextCorpus/).

OpenWebText2 (OWT2) is an enhanced version of the original OpenWebTextCorpus covering all Reddit submissions from 2005 up until April 2020, with further months becoming available after the corresponding PushShift dump files are released.
OpenWebText2 is an enhanced version of the original OpenWebTextCorpus covering all Reddit submissions from 2005 up until April 2020, with further months becoming available after the corresponding PushShift dump files are released.


## OpenWebText2 is now live! ## {class="text-announcement"}
[Download now](https://the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar), or you can [read the paper](https://openwebtext2.readthedocs.io/en/latest/#welcome)
[Download now](https://the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar), or you can [read the docs](https://openwebtext2.readthedocs.io)


## ## {class="content-block"}
Expand Down

0 comments on commit acfaf15

Please sign in to comment.