Skip to content
This repository has been archived by the owner on Jun 9, 2024. It is now read-only.

IORoot/scraper__instamancer

Repository files navigation

Instamancer+

Build Status Quality Coverage Speed NPM Dependencies Chat

Scrape Instagram's API with Puppeteer. Now with login detection.

Instamancer is a new type of scraping tool that leverages Puppeteer's ability to intercept requests made by a webpage to an API.

Read more about how Instamancer works here.

Features+

  • Login detection
  • Single browser instance option.
  • Scrape multiple users.

Features

  • Scrape hashtags, users' posts, and individual posts
  • Download images, albums, and videos
  • Output JSON, CSV
  • Batch scraping
  • Search hashtags, users, and locations
  • API response validation
  • Upload files to S3 and depot
  • Plugins

Data

Metadata that Instamancer is able to gather from posts:

  • Text
  • Timestamps
  • Tagged users
  • Accessibility captions
  • Like counts
  • Comment counts
  • Images (Thumbnails, Dimensions, URLs)
  • Videos (URL, View count, Duration)
  • Comments (Timestamp, Text, Like count, User)
  • User (Username, Full name, Profile picture, Profile privacy)
  • Location (Name, Street, Zip code, City, Region, Country)
  • Sponsored status
  • Gating information
  • Fact checking information

Install

Linux

Enable user namespace cloning:

sysctl -w kernel.unprivileged_userns_clone=1

Or run without a sandbox:

# WARNING: unsafe
export NO_SANDBOX=true

See Puppeteer troubleshooting

Without downloading chromium

If you wish to install Instamancer without downloading chromium, enable the PUPPETEER_SKIP_CHROMIUM_DOWNLOAD environment variable before installation

export PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true

From NPM

npm install -g instamancer

If you're using root to install globally, use the following command to install the Puppeteer dependency

sudo npm install -g instamancer --unsafe-perm=true

From NPX

npx instamancer

From this repository

git clone https://github.com/ScriptSmith/instamancer.git
cd instamancer
npm install
npm run build
npm install -g

/src/api/creds.json

If a login screen is detected, then use the creds.json file to login as that user and carry on. Copy the creds_demo.json to creds.json and enter real details to be used.

Usage

Command Line

$ instamancer
Usage: instamancer <command> [options]

Commands:
  instamancer hashtag [id]       Scrape a hashtag
  instamancer user [id]          Scrape a users posts
  instamancer users [ids]        Scrape a comma-separated list of users posts
  instamancer post [ids]         Scrape a comma-separated list of posts
  instamancer search [query]     Perform a search of users, tags and places
  instamancer batch [batchfile]  Read newline-separated arguments from a file

Configuration
  --count, -c    Number of posts to download (0 for all)   [number] [default: 0]
  --full, -f     Retrieve full post data              [boolean] [default: false]
  --sleep, -s    Seconds to sleep between interactions     [number] [default: 2]
  --graft, -g    Enable grafting                       [boolean] [default: true]
  --browser, -b  Browser path. Defaults to the puppeteer version        [string]
  --sameBrowser  Use a single browser when grafting   [boolean] [default: false]

Download
  --download, -d      Save images from posts          [boolean] [default: false]
  --downdir           Download path       [default: "downloads/[endpoint]/[id]"]
  --video, -v         Download videos (requires full) [boolean] [default: false]
  --sync              Force download between requests [boolean] [default: false]
  --threads, -k       Parallel download / depot threads    [number] [default: 4]
  --waitDownload, -w  Download media after scraping   [boolean] [default: false]

Upload
  --bucket  Upload files to an AWS S3 bucket                            [string]
  --depot   Upload files to a URL with a PUT request (depot)            [string]

Output
  --file, -o       Output filename. '-' for stdout    [string] [default: "[id]"]
  --type, -t       Filetype   [choices: "csv", "json", "both"] [default: "json"]
  --mediaPath, -m  Add filepaths to _mediaPath        [boolean] [default: false]

Display
  --visible    Show browser on the screen             [boolean] [default: false]
  --quiet, -q  Disable progress output                [boolean] [default: false]

Logging
  --logging, -l    [choices: "none", "error", "info", "debug"] [default: "none"]
  --logfile      Log file name             [string] [default: "instamancer.log"]

Validation
  --strict  Throw an error on response type mismatch  [boolean] [default: false]

Plugins
  --plugin, -p  Use a plugin from the plugins directory    [array] [default: []]

Options:
  --help     Show help                                                 [boolean]
  --version  Show version number                                       [boolean]

Examples:
  instamancer hashtag instagood -fvd        Download all the available posts,
                                            and their media from #instagood
  instamancer user arianagrande --type=csv  Download Ariana Grande's posts to a
  --logging=info --visible                  CSV file with a non-headless
                                            browser, and log all events
  instamancer users arianagrande,therock    Download Ariana Grande's and the
  -c 3                                      Rock's latest three posts.
Source code available at https://github.com/IORoot/instamancer

Module

ES2018 Typescript example:

import {createApi, IOptions} from "instamancer"

const options: IOptions = {
    total: 10
};
const hashtag = createApi("hashtag", "beach", options);

(async () => {
    for await (const post of hashtag.generator()) {
        console.log(post);
    }
})();

Generator functions

import {createApi} from "instamancer"

createApi("hashtag", id, options);
createApi("user", id, options);
createApi("users", ids, options);
createApi("post", ids, options);
createApi("search", query, options);

Options

const options: Instamancer.IOptions = {
    // Total posts to download. 0 for unlimited
    total: number,

    // Run Chrome in headless mode
    headless: boolean,

    // Logging events
    logger: winston.Logger,

    // Run without output to stdout
    silent: boolean,

    // Time to sleep between interactions with the page
    sleepTime: number,

    // Throw an error if type validation has been failed
    strict: boolean,

    // Time to sleep when rate-limited
    hibernationTime: number,

    // Enable the grafting process
    enableGrafting: boolean,

    // Extract the full amount of information from the API
    fullAPI: boolean,

    // Use a proxy in Chrome to connect to Instagram
    proxyURL: string,

    // Location of the chromium / chrome binary executable
    executablePath: string,

    // Custom io-ts validator
    validator: Type<unknown>,

    // Custom plugins
    plugins: IPlugin[]
}

Comparison

A comparison of Instagram scraping tools. Please suggest more tools and criteria through a pull request.

To see a speed comparison, visit this page

Tool Hashtags Users Tagged posts Locations Posts Stories Login not required Private feeds Batch mode Plugins Command-line Library/Module Download media Download metadata Scraping method Daily builds Main language Speed ____________________________ License ____________________________ Last commit ____________________________ Open Issues ____________________________ Closed Issues ____________________________ Build status ____________________________ Test coverage ____________________________ Code quality ____________________________
Instamancer ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API request interception ✔️ Typescript
Instaphyte ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation ✔️ Python
Instaloader ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation Python
Instalooter ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation Python
Instagram crawler ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web DOM reading Python
Instagram Scraper ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation Python
Instagram Private API ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ App and Web API simulation Python
Instagram PHP Scraper ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation PHP

About

Scrape Instagram's API with Puppeteer

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • TypeScript 85.4%
  • JavaScript 14.6%