Skip to content

17zuoye/etl_utils

Repository files navigation

ETL Utils

Build Status Coverage Status Health Download License Python Versions

All API are listed in etl_utils/__init__.py.

Usage

Install it.

pip install etl_utils

Import it.

from etl_utils import * # increase only 6 MB memory.

Feature List

1. Terminal

1.1. process_notifier

from etl_utils import process_notifier
import time
for i1 in process_notifier(iteratable_object, msg=u"RANGE"):
    # process(i1)
    time.sleep(0.005)
# Example output is:
#   [pid 17510] RANGE processing 500 records 100% |###################################################################################################################| 166.61 items/s

Requirements about iteratable_object:

  1. Iteratable data structure, e.g. generator, list like or dict like object, any orm query, or file object.
  2. Exist a way to fetch total count of this iteratable_object, but it's optional for lazy generator.

1.2. uprint

Python's default print function can only deal with basic unicode, but not the Chinese unicode in nested dict or list. So let's transfer this functionality to uprint.

But remember that string type must convert into unicode type, or the output will be garbled.

Example:

>>> print({u"你好":u"世界"})
{u'\u4f60\u597d': u'\u4e16\u754c'}
>>> from etl_utils import uprint
>>> tmp = uprint({u"你好":u"世界"})
{u'你好': u'世界'}
>>>

2. Cache

2.1. cpickle_cache function

cpickle_cache(cache_file_path, generate_data_func)

Generate cache data if cache_file_path not exists.

2.2. cached_property decorator

Turn a function into a property.

class Universe:
    @cached_property
    def answer(self):
        return 42

answer = Universe().answer # no ()
assert answer, 42 // True

2.3. classproperty

Similar to cached_property, but it's a property on a class itself.

3. Design Pattern

3.1. singleton

Singleton pattern restricts the instantiation of a class to one object, see more informations at Wikipedia .

@singleton() # or @singleton(multi_init=True)
class MySingleton(object):

    @cached_property
    def heavy_cpu(self):
        # process ...
        return cached_data

    def another_function(self, params):
        return process(params)

o1 = MySingleton()
o2 = MySingleton()
assert o1, o2 // True

Re-import MySingleton package will not cause initializing MySingleton class twice, so you can encapsulate a series of functions and data into MySingleton class.

This function is thread-safe, and is imported from https://pypi.python.org/pypi/pysingleton .

4. Basic data structure utils

ListUtils.most_common_inspect(list1)
ListUtils.uniq_seqs(seqs, uniq_lambda=None)
StringUtils.merge(*strs)
StringUtils.calculate_text_similarity(text1, text2, inspect=False, similar_rate_baseline=0.0, skip_special_chars=False)
StringUtils.frequence_chars_info(str1, length_lambda=lambda len1 : len1)
DictUtils.nested_read(dict1, keys, default_val=None)
DictUtils.add_default_value(dict1, default_value=None)
UnicodeUtils.is_chinese(uchar)
UnicodeUtils.is_number(uchar)
UnicodeUtils.is_alphabet(uchar)
UnicodeUtils.is_other(uchar)
UnicodeUtils.B2Q(uchar)
UnicodeUtils.is_Q(uchar)
UnicodeUtils.Q2B(uchar)
UnicodeUtils.stringQ2B(ustring, convert_strs={})
UnicodeUtils.uniform(ustring)
UnicodeUtils.string2List(ustring)
UnicodeUtils.ljust(str1, width, fillchar=' ')
UnicodeUtils.rjust(str1, width, fillchar=' ')
UnicodeUtils.just_str(self)
UnicodeUtils.read(filename)
HashUtils.hashvalue_with_sorted(str1)
ItertoolsUtils.split_seqs_by_size(seqs1, size1, inspect=False)
JsonUtils.unicode_dump(item1)

generated by ruby generate_api_doc.rb

5. LazyData

Load data only when needed.

from etl_utils import ld
ld.en_us_dict
ld.two_length_words
ld.regular_words
ld.lemmatize(word1)
ld.tagged_words__dict
ld.jieba

from etl_utils import regexp
regexp.alphabet
regexp.word
regexp.upper
regexp.object_id
regexp.special_chars

6. Memory

# `slots_with_pickle` decorator adding `__slots__` to these classes can
# dramatically reduce the memory footprint, and improve execution speed
# by eliminating the instance dictionary.
# And it also possible to pickle/unpickle objects.

@slots_with_pickle('attr_a', 'attr_b', 'attr_c')
class Slots(object):
    def __init__(self):
        attr_a = 'a'
        attr_b = 'b'
        attr_c = 'c'

7. Other utils

calculate_entropy(feature_with_count_dict)

is_nltk_word(str1) # is valid English word
extract_words(sentence)

ItemIncrementIdDict # Assign an auto increment integer to item, e.g. an object_id
ItemsGroupAndIndexes # group result

MarkObjectIds # mark processed objects group

# Sequentially process lambda in `lambdas`, return the first one with no exception.
set_default_value(lambdas, msg=u"")

Run tests

pip install -r requirements.txt
pip install nose
nosetests

License

MIT. David Chen @ 17zuoye.