Name		Name	Last commit message	Last commit date
parent directory ..
benchmarks		benchmarks
doc/examples		doc/examples
fastText		fastText
README.md		README.md
README.rst		README.rst

README.md

fastText

fastText is a library for efficient learning of word representations and sentence classification.

Requirements

fastText builds on modern Mac OS and Linux distributions. Since it uses C++11 features, it requires a compiler with good C++11 support. These include :

(gcc-4.8 or newer) or (clang-3.3 or newer)

You will need

Python version 2.7 or >=3.4
NumPy & SciPy
pybind11

Building fastText

The easiest way to install fastText is to use pip.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .

Alternatively you can also install fastText using setuptools.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ python setup.py install

Now you can import this library with

import fastText

Examples

In general it is assumed that the reader already has good knowledge of fastText. For this consider the main README and in particular the tutorials on our website.

We recommend you look at the examples within the doc folder.

As with any package you can get help on any Python function using the help function.

For example

+>>> import fastText
+>>> help(fastText.FastText)

Help on module fastText.FastText in fastText:

NAME
    fastText.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the BSD-style license found in the
    # LICENSE file in the root directory of this source tree. An additional grant
    # of patent rights can be found in the PATENTS file in the same directory.

FUNCTIONS
    load_model(path)
        Load a model given a filepath and return a model object.

    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
[...]

IMPORTANT: Preprocessing data / enconding conventions

In general it is important to properly preprocess your data. In particular our example scripts in the root folder do this.

fastText assumes UTF-8 encoded text. All text must be unicode for Python2 and str for Python3. The passed text will be encoded as UTF-8 by pybind11 before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using iconv.

fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.

space
tab
vertical tab
carriage return
formfeed
the null character

The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX_LINE_SIZE constant as defined in the Dictionary header. This means if you have text that is not separate by newlines, such as the fil9 dataset, it will be broken into chunks with MAX_LINE_SIZE of tokens and the EOS token is not appended.

The length of a token is the number of UTF-8 characters by considering the leading two bits of a byte to identify subsequent bytes of a multi-byte sequence. Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the Dictionary header) is considered a character and will not be broken into subwords.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python

python

README.md

fastText

Requirements

Building fastText

Examples

IMPORTANT: Preprocessing data / enconding conventions

Files

python

Directory actions

More options

Directory actions

More options

Latest commit

History

python

Folders and files

parent directory

README.md

fastText

Requirements

Building fastText

Examples

IMPORTANT: Preprocessing data / enconding conventions