Skip to content

Command line Python scripting with an xargs-like interface and AWK-like capabilities for data processing and task automation

License

Notifications You must be signed in to change notification settings

elesiuta/pyxargs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyxargs

This started as a simple solution to the encoding problem with xargs. It is a partial and opinionated implementation of xargs with the goal of being easier to use for most use cases.

It also contains additional features for AWK-like data processing, such as taking python code as arguments to be executed, or filtering with regular expressions. Some of these features take inspiration from pyp, Pyed Piper, and Pyped. A great comparison of them is provided by pyp, which mainly differs from pyxargs in that pyxargs has more of an xargs-like interface and built in file tree traversal (replacing the need for find), but lacks the AST introspection and manipulation of pyp which infers more from the command without passing flags.

You can install pyxargs from PyPI. Optionally depends on duckdb and pandas. Supports tab completion with argcomplete.

Command Line Interface

usage: pyxr [options] command [initial-arguments ...]
       pyxr -h | --help | --version

Build and execute command lines, python code, or mix from standard input or
file paths. The file input mode (default if stdin is not connected) builds
commands using filenames only and executes them in their respective
directories, this is useful when dealing with file paths containing multiple
character encodings. When executing python code, the following variables are
provided: i=index, j=remaining, n=total, x=input, s=split, d=dir,
a=all_inputs, out=previous_results, df=dataframe, js=json, db=duckdb

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -m input-mode, --mode input-mode
                        options are:
                        (f)ile    = build commands from filenames and execute
                                    in each subdirectory respectively
                        (p)ath    = build commands from file paths relative
                                    to the current directory and execute in
                                    the current directory
                        (a)bspath = build commands from absolute file paths
                                    and execute in the current directory
                        (s)tdin   = build commands from standard input and
                                    execute in the current directory
                        default: stdin if connected, otherwise file
  --folders             use folders instead files (for input modes: file,
                        path, abspath)
  -t, --top             do not recurse into subdirectories (for input modes:
                        file, path, abspath)
  --sym, --symlinks     follow symlinks when scanning directories (for input
                        modes: file, path, abspath)
  -a file, --arg-file file
                        read input items from file instead of standard input
                        (for input mode: stdin)
  -0, --null            input items are separated by a null character instead
                        of whitespace (for input mode: stdin)
  -l, --lines           input items are separated by a newline character
                        instead of whitespace (for input mode: stdin)
  -d delim, --delimiter delim
                        input items are separated by the specified delimiter
                        instead of whitespace (for input mode: stdin)
  -s regex, --split regex
                        split each input item with re.split(regex, input)
                        before building command (after separating by
                        delimiter), use {0}, {1}, ... to specify placement
                        (implies --format), it is also stored as a list in the
                        variable s
  -g regex, --groups regex
                        use regex capturing groups on each input item with
                        re.search(regex, input).groups() before building
                        command (after separating by delimiter), use {0}, {1},
                        ... to specify placement (implies --format), it is
                        also stored as a tuple in the variable s
  --format              format command with input using str.format() instead
                        of appending or replacing via -I replace-str, use {0},
                        {1}, ... to specify placement, if the command is then
                        evaluated as an f-string (--fstring) escape using
                        double curly braces as {{expr}} to evaluate
                        expressions
  -I replace-str        replace occurrences of replace-str in command with
                        input, default: {}
  --resub pattern substitution replace-str
                        replace occurrences of replace-str in command with
                        re.sub(patten, substitution, input)
  -r regex, --filter regex
                        only build commands from inputs matching regex for
                        input mode stdin, and matching relative paths for all
                        other input modes, uses re.search
  -o, --omit            omit inputs matching regex instead
  -b, --basename        only match regex against basename of input (for input
                        modes: file, path, abspath)
  -f, --fstring         evaluates commands as python f-strings before
                        execution
  --df                  reads each input into a dataframe and stores it in
                        variable df, requires pandas
  --js                  reads each input as a json object and stores it in
                        variable js
  --max-chars n         omits any command line exceeding n characters, no
                        limit by default
  --sh, --shell         executes commands through the shell (subprocess
                        shell=True) (warning, shlex.quote is not guaranteed to
                        be correct on Windows)
  -x, --pyex            executes commands as python code using exec()
  -e, --pyev            evaluates commands as python expressions using eval()
                        then prints the result
  -p, --pypr            evaluates commands as python f-strings then prints
                        them (implies --fstring)
  -q, --sql             reads each input into variable db then runs commands
                        as SQL queries using duckdb.sql(), requires duckdb
  --import library      executes 'import <library>' for each library
  --im library, --importstar library
                        executes 'from <library> import *' for each library
  --pre "code"          runs exec(code) before execution
  --post "code"         runs exec(code) after execution
  -P P, --procs P       split into P chunks and execute each chunk in parallel
                        as a separate process and window with byobu or tmux
  -c c, --chunk c       runs chunk c of P (0 <= c < P) (without multiplexer)
  --no-mux              do not use a multiplexer for multiple processes
  -i, --interactive     prompt the user before executing each command, only
                        proceeds if response starts with 'y' or 'Y'
  -n, --dry-run         prints commands without executing them
  -v, --verbose         prints commands before executing them

Examples

# by default, pyxargs will use filenames and run commands in each directory
  > pyxr echo

# instead of appending inputs, you can specify a location with {}
  > pyxr echo spam {} spam

# and like xargs, you can also specify the replace-str with -I
  > pyxr -I eggs echo spam eggs spam literal {}

# if stdin is connected, it will be used instead of filenames by default
  > echo bacon eggs | pyxr echo spam

# python code can be used in place of a command
  > pyxr --pyex "print(f'input file: {} executed in: {os.getcwd()}')"

# a shorter version of this command with --pypr (-p) and the magic variable d
  > pyxr -p "input file: {} executed in: {d}"

# python f-strings can also be used to format regular commands
  > pyxr -f echo "input file: {x} executed in: {d}"

# python code can also run before or after all the commands
  > pyxr --pre "n=0" --post "print(n,'files')" -x "n+=1"

# you can also evaluate and print python f-strings, the index i is provided
  > pyxr --pypr "number: {i}\tname: {}"

# provided variables:
# i = index, j = remaining, n = total, x = input, d = dir
# a = a list of all inputs, so a[i]=x
# out = a list of previous outputs, so out[i]=output (for -e, -p, -q)
# s = a list of columns if each x is a row, by default s=x.split()
# if the input mode is path or abspath, s=x.split(os.path.sep)
# if the input mode is file, s=os.path.splitext(x)
# if -s or -g is specified, then it is re.split() or re.search().groups()
# other variables are provided with flags: --df, --js, --sql
  > pyxr -p "i={i}\tj={j}\tn={n}\tx={x}\td={d}\ta[{i}]={a[i]}={a[-j]}\ts={s}"
  > pyxr -p "prev: {'START' if i<1 else a[i-1]}\t" \
               "current: {a[i]}\tnext: {'END' if j<1 else a[i+1]}"

# given variables are only in the global scope, so they won't overwrite locals
  > pyxr --pre "i=1;j=2;n=5;x=3;a=3;" -p "i={i} j={j} n={n} x={x} a={l}"

# you can also use dataframes as df with --df (requires pandas)
  > echo A,B,C\n1,2,3\n4,5,6 | pyxr -0 --df -p "{df}"

# or query sql databases as db with --sql (-q) (requires duckdb)
  > echo A,B,C\n1,2,3\n4,5,6 | pyxr -0 -q "SELECT * FROM db"
  > echo '{"a": 1,"b": 2}' | pyxr -0 -q "SELECT * FROM db"

# it can also read from databases, csv files, etc. (see duckdb extensions)
  > pyxr -t -r .sqlite -q "SELECT * FROM <table>"
  > pyxr -t -r .sqlite -f -q "SELECT * FROM {db[0]}"
  > pyxr -t -r .csv -q "SELECT * FROM db"
  > pyxr -t -q "SELECT * FROM db"
  > pyxr -t -q "SELECT * FROM '{}'"

# regular expressions can be used to filter and modify inputs
  > pyxr -r \.py --resub \.py .txt {new} echo {} -\> {new}

# you can test your command first with --dry-run (-n) or --interactive (-i)
  > pyxr -i echo filename: {}

# pyxargs can also run interactively in parallel by using byobu or tmux
  > pyxr -P 4 -i echo filename: {}

# you can use pyxargs to create a JSON mapping of /etc/hosts
  > cat /etc/hosts | pyxr -d \n --im json --pre "d={}" \
    --post "print(dumps(d))" -x "d['{}'.split()[0]] = '{}'.split()[1]"

# you can also do this with format strings and --split (-s) (uses regex)
  > cat /etc/hosts | pyxr -d \n -s "\s+" --im json --pre "d={}" \
    --post "print(dumps(d))" -x "d['{0}'] = '{1}'"

# use double curly braces to escape for f-strings since str.format() is first
  > cat /etc/hosts | pyxr -d \n -s "\s+" -p "{{i}}:{{'{1}'.upper()}}"

# this and the following examples will compare usage with find & xargs
  > find ./ -name "*" -type f -print0 | xargs -0 -I {} echo {}
  > find ./ -name "*" -type f -print0 | pyxr -0 -I {} echo {}

# pyxargs does not require '-I' to specify a replace-str (default: {})
  > find ./ -name "*" -type f -print0 | pyxr -0 echo {}

# and in the absence of a replace-str, exactly one input is appended
  > find ./ -name "*" -type f -print0 | pyxr -0 echo
  > find ./ -name "*" -type f -print0 | xargs -0 --max-args=1 echo
  > find ./ -name "*" -type f -print0 | xargs -0 --max-lines=1 echo

# pyxargs can use file paths as input without piping from another program
  > pyxr -m path echo ./{}

# and now for something completely different, python code for the command
  > pyxr -m path -x "print('./{}')"

About

Command line Python scripting with an xargs-like interface and AWK-like capabilities for data processing and task automation

Topics

Resources

License

Stars

Watchers

Forks

Languages