Skip to content

Record visits to a site so that they're easy to analyse afterward

Notifications You must be signed in to change notification settings

learnable/visit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

visit

Store some subset (or all) of an app's http requests in a database.

Get the data out using Active Record Query Interface.

Install the gem into your app

# add to Gemfile
bundle
rails generate visit:install
rake db:migrate

Customise

To customise, create a config/initializers/visit.rb, eg:

Visit::Configurable.configure do |c|

  c.bulk_insert_batch_size = 100 # cache requests in a SerializedQueue (see below)

  # This method is called when requests are on the :enroute SerializedQueue
  # Your options are:
  # - don't override this method in your app, Visit::Factory.new.run will insert these
  #   requests (in the Rails request cycle)
  # - override this method in your app and delegate Visit::Factory.new.run to a worker
  # - override this method in your app, make it do nothing, because you have workers
  #   that pop directly from the :enroute queue
  #
  c.bulk_insert_now = ->() do
    Visit::Factory.new.run
  end

  c.cookies_match = [
    /^flip_/, # save cookies set via the flip gem
  ]

  c.current_user_id = -> (controller) do
    controller.instance_eval { current_user ? current_user.id : nil }
  end

  c.ignorable = [
      /^\/api/, # don't store requests to /api
    ]

  # Some slow-running parts of the gem are instrumented.
  # To get a sense of it, bundle exec rails console:
  # > puts Visit::Log.last.to_instrumenter_history.to_s
  #
  c.instrumenter_toggle = ->(category) do
    true # category == :deduper || category == :factory
  end

  c.token_cookie_mutator = :visit_tag_controller # or :application_controller

  c.labels_match_first = [
      [ :get, %r{^/contact}, :contact_prompt ]
    ]

  # urls containing ?invite=blah generate a trait: { :invite => :blah }
  #
  c.labels_match_all = c.labels_match_all.push *[
    [ :get, %r{[?&]invite=(\w+)}, :invite ]
  ]

  # If you set bulk_insert_batch_size > 1, you need a persistent SerializedQueue:
  # - in your app, add 'redis' to your Gemfile
  # - in your app, configure redis in config/initializers/redis.rb: $redis = Redis.connect(url: Settings.redis.url)
  #
  require 'redis'
  c.serialized_queue = ->(key) { Visit::SerializedQueue::Redis.new($redis, key) }

  # our app uses Airbrake for exception handling
  #
  c.notify = ->(e) { Airbrake.notify e } unless Rails.env.development?

  # lighten the load on the db (far fewer SELECTs)
  #
  c.cache = Visit::Cache::Dalli.new \
    ActiveSupport::Cache.lookup_store \
      :dalli_store,
      "127.0.0.1:11211",
      { :namespace => "#{Rails.application.class.parent_name}::visit", :expires_in => 28.days }
    #.silence! - stops cache logging to development.log
end

Sample implementation of VisitFactoryWorker:

class VisitFactoryWorker
  include Sidekiq::Worker

  def perform
    Visit::Factory.new.run
  end
end

Label and captures

Visit::Configurable.labels allows the app to associate labels (and regexp captures) with URL paths.

Which in turn supports queries like this:

Visit::Query::LabelledEvent.new.scoped.
  where("label_vtv.v = 'dashboard'").
  where(created_at: (1.day.ago..Time.now)).
  count

How the gem hooks into the Rails request cycle

In brief:

  • a controller filter builds a request_payload_hash (containing everything interesting about a web request), and pushes it onto the :filling SerializedQueue
  • when the :filling SerializedQueue is full? it is moved into an :enroute SerializedQueue and Configurable.bulk_insert_now is called
  • the request_payload_hashes are removed from the :enroute SerializedQueue and inserted into the database.

My app is part Rails and part non-Rails

Non Rails apps can push a request_payload_hash directly onto the :filling queue.

To figure out:

  • the format of the hash, see: rails_request_context.rb, and
  • the redis key, run from the Rails console, Visit::SerializedQueue::Redis.new($redis, :filling).send(:key)

Deduper

The gem supports eventual consistency of SourceValues and TraitValues for reasons of:

  • performance (bulk insert of n requests is many times faster than n inserts),
  • scalability (multiple workers can by bulk inserting at the same time), and
  • mysql indexes can only cover the first 255 chars of a VARCHAR column (ignoring innodb_large_prefix), so the 'v' columns must have non-unique indexes.

When consistent, each row in tables visit_source_values and visit_trait_values have a unique value of 'v'.

To create consistency, your app should periodically run Visit::Deduper.new.run (eg. daily) to eliminate duplicate values of 'v' and fix any references to those duplicates.

Here's what a sidekiq worker looks like:

require "visit"

class VisitDeduperWorker < BaseWorker
  def perform
    begin
      Visit::Deduper.new.run
    rescue
      Airbrake.notify $!
    end
  end
end

MySQL users: if you are happy to increase innodb_large_prefix, you can then increase the index :length limits in the CreateVisitSourceValues and CreateVisitTraitValues migrations. It might give you a little more lookup performance - when there are strings that are the same in the first 255 chars and different after that.

Destroying unused rows

There are a number of ways you can be storing data you don't need:

  • you don't set Configurable.ignorable,
  • you narrow the set of cookies you're interested in (Configurable.cookies_match)

If you then want to save space in your database:

bundle exec rails console
> Visit::DestroyUnused.new(dry_run: true).sources! { |sources| puts sources.map { |source| [source.key.v, source.value.v] } }
> Visit::DestroyUnused.new(dry_run: true).events! { |events| puts events.map { |event| event.url } }
# oh, I want to keep a url that's ignored, because I created it via `create_visit_event`
> Visit::DestroyUnused.new(dry_run: true, keep_urls: [ %r{/api} ]).events! { |events| puts events.map { |event| event.url } }
> Visit::DestroyUnused.new(dry_run: true).source_values! { |source_values| puts source_values.map { |sv| sv.v } }
# ok, looks good, I'm now going to irrevocably delete!
> Visit::DestroyUnused.new(keep_urls: [ %r{/api} ]).irrevocable!

Flush Configurable.cache

bundle exec rails console
> Visit::Configurable.cache.has_key? Visit::Cache::Key.new("visit::traitvalue.find_by_v.id", "label")
true
> Visit::Configurable.cache.clear
[true]
> Visit::Configurable.cache.has_key? Visit::Cache::Key.new("visit::traitvalue.find_by_v.id", "label")
false

Inspecting the queues

bundle exec rails console
> sqm=Visit::SerializedQueue::Manager.new
=> #<Visit::SerializedQueue::Manager:0x000000072f9a48>
> sqm.queue_lengths
=> [{:filling=>1, :enroute=>1}, {:enroute=>[{"/POFK9EXX2QcThIW"=>10}]}]
> sqm.transfer_to_enroute
=> "uUt/oW3nThtGfBH3"
> sqm.queue_lengths
=> [{:filling=>0, :enroute=>2}, {:enroute=>[{"/POFK9EXX2QcThIW"=>10}, {"uUt/oW3nThtGfBH3"=>1}]}]
> Visit::Factory.new.run
=> ...
> sqm.queue_lengths
=> [{:filling=>0, :enroute=>1}, {:enroute=>[{"uUt/oW3nThtGfBH3"=>1}]}]
> Visit::Factory.new.run
=> ...
> sqm.queue_lengths
=> [{:filling=>0, :enroute=>0}]

Configure the gem to not use the default database

Visit::Configurable.configure do |c|

  c.db_connect = "visit_database_for_#{Rails.env}"

end

And in your database.yml:

visit_database_for_development:
  database: visit_development

visit_database_for_test:
  database: visit_test

visit_database_for_production:
  database: visit_production

And of course you need to create those databases, set permissions, apply schemas etc.

Developing the gem

git clone [email protected]:learnable/visit.git

mysql

$ mysql -u root

CREATE DATABASE visit;
CREATE DATABASE visit_test;
GRANT usage on *.* TO visit@localhost IDENTIFIED BY 'visit';
GRANT ALL PRIVILEGES on visit.* to visit@localhost;
GRANT ALL PRIVILEGES on visit_test.* to visit@localhost;

postgres

Via psql

CREATE USER visit CREATEDB;

redis

You'll need a redis server running.

Memcache

You'll need a memcache server running on port 11211.

Then

bundle
cd spec/dummy
rm db/migrate/*_visit_* # only necessary if migrations change, but can't hurt
bundle exec rake db:create
rails g visit:migration
bundle exec rake db:migrate
bundle exec rake db:migrate RAILS_ENV=test

visit_event_views

For debugging or ad-hoc sql queries it's sometimes nice to have a denormalised view of the data that the gem is storing.

This sql query creates a database view for that purpose.

CREATE VIEW visit_event_views AS
SELECT
  DISTINCT visit_events.id as id,
  visit_events.http_method_enum as http_method_enum,
  url_vsv.v as url,
  user_id,
  token,
  label_vtv.v as label,
  capture1_vtv.v as capture1,
  capture2_vtv.v as capture2,
  user_agent_vsv.v as user_agent,
  visit_events.created_at as created_at
FROM visit_events

INNER JOIN visit_source_values url_vsv
  ON visit_events.url_id = url_vsv.id

INNER JOIN visit_source_values user_agent_vsv
  ON visit_events.user_agent_id = user_agent_vsv.id

LEFT OUTER JOIN visit_traits label_vt
  ON visit_events.id = label_vt.visit_event_id AND label_vt.k_id = (select id from visit_trait_values where v = 'label')
LEFT OUTER JOIN visit_trait_values label_vtv
  ON label_vtv.id = label_vt.v_id

LEFT OUTER JOIN visit_traits capture1_vt
  ON visit_events.id = capture1_vt.visit_event_id AND capture1_vt.k_id = (select id from visit_trait_values where v = 'capture1')
LEFT OUTER JOIN visit_trait_values capture1_vtv
  ON capture1_vtv.id = capture1_vt.v_id

LEFT OUTER JOIN visit_traits capture2_vt
  ON visit_events.id = capture2_vt.visit_event_id AND capture2_vt.k_id = (select id from visit_trait_values where v = 'capture2')
LEFT OUTER JOIN visit_trait_values capture2_vtv
  ON capture2_vtv.id = capture2_vt.v_id

ORDER BY visit_events.id ASC

TODO

MAJOR

MODERATE

  • archiving - zip up everying over n months old and send to S3?

About

Record visits to a site so that they're easy to analyse afterward

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages