-
Berlin-Brg. Academy of Sciences (BBAW)
- Berlin
- adrien.barbaresi.eu
- @adbarbaresi
- in/adrienbarbaresi
Block or Report
Block or report adbar
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuse-
trafilatura Public
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
courlan Public
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
-
htmldate Public
Fast and robust date extraction from web pages, with Python or on the command-line
-
German-NLP Public
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
-
simplemma Public
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
-
py3langid Public
Forked from saffsd/langid.pyFaster, modernized fork of the language identification tool langid.py
-
-
awesome-web-scraper Public
Forked from duyet/awesome-web-scraperA collection of awesome web scaper, crawler.
-
datatrove Public
Forked from huggingface/datatroveFreeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Python Apache License 2.0 UpdatedMar 20, 2024 -
dwdsmor Public
Forked from zentrum-lexikographie/dwdsmorSFST/SMOR/DWDS-based German Morphology
XSLT GNU Lesser General Public License v3.0 UpdatedJul 7, 2023 -
wee-benchmarking-tool Public
Forked from Nootka-io/wee-benchmarking-toolPython MIT License UpdatedOct 27, 2022 -
coronakorpus Public
Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
-
-
-
jusText Public
Forked from miso-belica/jusTextHeuristic based boilerplate removal tool
-
btw21 Public
Forked from jfilter/btw21Visualization of the most frequent words in the German federal election in 2021
Jupyter Notebook MIT License UpdatedSep 24, 2021 -
awesome-crawler Public
Forked from BruceDone/awesome-crawlerA collection of awesome web crawler,spider in different languages
-
python-readability Public
Forked from buriy/python-readabilityfast python port of arc90's readability tool, updated to match latest readability.js!
HTML UpdatedFeb 19, 2020 -
jparser Public
Forked from fxsjy/jparserA readability parser which can extract title, content, images from html pages
Python MIT License UpdatedFeb 7, 2020 -
cChardet Public
Forked from PyYoshi/cChardetuniversal character encoding detector
Python Other UpdatedNov 29, 2019 -
jlcl-style Public archive
Experiments to modernize the LaTeX class of the JLCL
-
archiveis Public
Forked from palewire/archiveisA simple Python wrapper for the archive.is capturing service
Python MIT License UpdatedJul 9, 2019 -
geokelone Public
integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
-
toponyms Public
Old prototype for toponym extraction in historical texts written in German
-
dateparser Public
Forked from scrapinghub/dateparserpython parser for human readable dates
Python BSD 3-Clause "New" or "Revised" License UpdatedSep 11, 2017 -
vardial-experiments Public
Experiments conducted on the occasion of the VarDial shared tasks
-
valency-oriented-chunker Public
A one-pass FSA valency-oriented chunker for German (proof of concept)
Perl GNU Lesser General Public License v3.0 UpdatedOct 14, 2016 -
-
-
flux-toolchain Public
Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain