Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft make_query Function for R-Style Solr Queries #58

Open
njlyon0 opened this issue Feb 23, 2024 · 3 comments
Open

Draft make_query Function for R-Style Solr Queries #58

njlyon0 opened this issue Feb 23, 2024 · 3 comments

Comments

@njlyon0
Copy link

njlyon0 commented Feb 23, 2024

Summary

Hey EDIutils team! I had a conversation with Colin Smith and Greg Maurer recently about creating a make_query function to help make Solr queries for people with some R literacy but limited prior exposure to Solr. The hope is that this new function would make it easier for R users to make good use of EDIutils::search_data_packages.

I've taken a stab at this function and will attach the full code to this issue. Note that I also wrote two helper functions solr_wild and solrize to make the internal components of make_query as streamlined as possible. I'm definitely a novice to Solr queries so make_query may be missing crucial arguments but I think it's a reasonable starting point and is built to be semi-modular and could easily support additional arguments. All functions are written in base R (version 4.3.2).

Let me know if this doesn't work on your end and/or if you'd like me to make any changes before it could possibly be built into EDIutils. Thanks!

Function Demo Script

# Load needed libaries
library(EDIutils)

# Clear environment
rm(list = ls())

# Define helper function
## Swaps human equivalents of wildcards for Solr wildcard
solr_wild <- function(bit){
  
  # Handle empty `bit`
  if(is.null(bit) == TRUE){
    
    # Replace with wildcard
    bit_v2 <- "*"
  }
  
  # Handle English equivalents for wildcard
  else if(length(bit) == 1){
    
    # Replace allowed keywords with wildcard
    bit_v2 <- gsub(pattern = "all|any", replacement = "*", x = bit)
  } 
  
  # If neither condition is met, return whatever was originally supplied
  else { bit_v2 <- bit }
  
  # Return finished product
  return(bit_v2) }

# Example(s)
solr_wild(bit = NULL)
solr_wild(bit = "any")
solr_wild(bit = "something else")

# Define helper function
## Parses English text into Solr syntax (i.e., right delimiters, etc.)
solrize <- function(bit){
  
  # Replace spaces with hyphens
  bit_v2 <- gsub(pattern = " ", replacement = "-", x = bit)
  
  # If more than one value, handle that
  if(length(bit_v2) > 1){
    
    # Collapse with plus signs
    bit_v3 <- paste0("(", paste0(bit_v2, collapse = "+"), ")")
    
  } else { bit_v3 <- bit_v2 }
  
  # Return finished bit
  return(bit_v3) }

# Example(s)
solrize(bit = c("primary production", "plants"))

# Define function to generate query
make_query <- function(keywords = NULL, subjects = NULL, authors = NULL, 
                       scopes = NULL, excl_scopes = NULL, 
                       return_fields = "all", limit = 10){

  ## Error Checking ----
  # Define supported return 'return_fields'
  good_fields <- c("*", "all", "abstract", "begindate", "doi", "enddate", "funding", "geographicdescription", "id", "methods", "packageid", "pubdate", "responsibleParties", "scope", "site", "taxonomic", "title", "authors", "spatialCoverage", "sources", "keywords", "organizations", "singledates", "timescales")
  
  # Error out for unsupported ones
  if(all(return_fields %in% good_fields) != TRUE)
    stop("Unrecognized return field(s): ", 
         paste(base::setdiff(x = return_fields, y = good_fields), collapse = "; "))
  
  # Error out for non-numeric limit
  if(is.numeric(limit) != TRUE){
    message("`limit` must be numeric, coercing to 10")
    limit <- 10 }
  
  ## Solr Query Construction ----
  # Make start of query object
  query_v0 <- "q="
  
  # If keywords are provided:
  ### 1. Turn into Solr Syntax
  solr_kw <- solrize(bit = solr_wild(bit = keywords)) 
  
  ### 2. Add to query
  query_v1 <- paste0(query_v0, "keyword:", solr_kw)
  
  # Handle authors
  solr_aut <- solrize(bit = solr_wild(bit = authors))
  query_v2 <- paste0(query_v1, "&fq=", "author:", solr_aut)
  
  # Handle subjects
  solr_sub <- solrize(bit = solr_wild(bit = subjects))
  query_v3 <- paste0(query_v2, "&fq=", "subject:", solr_sub)
  
  # Handle scopes
  solr_scp <- solrize(bit = solr_wild(bit = scopes))
  query_v4 <- paste0(query_v3, "&fq=", "scope:", solr_scp)
  
  # EXCLUDED scopes
  ## Handled differently because don't want to swap `NULL` for wildcard
  if(is.null(excl_scopes) != TRUE){
    
    # Solr-ize
    solr_excl_scp <- solrize(bit = excl_scopes)
    
    # Add to query
    query_v5 <- paste0(query_v4, "&fq=", "-scope:", solr_excl_scp)
    
    # Or skip
  } else { query_v5 <- query_v4 }
  
  # Parse return fields
  ## Solr syntax for multiple entries differs here from other elements of query
  solr_fl <- paste(solr_wild(bit = return_fields), collapse=",")
  query_v6 <- paste0(query_v5, "&fl=", solr_fl)
  
  # Finally, assemble full query with row limit
  solr_query <- paste0(query_v6, "&rows=", limit)
  
  # Return that to the user
  return(solr_query) }

#  Invoke function
( request <- make_query(keywords = "*", 
                        scopes = "knb-lter-fce",
                        excl_scopes = c("ecotrends", "lter landsat"),
                        return_fields =  c("title", "authors", "id", "doi"),
                        limit = 10) )

# Test assembled query
EDIutils::search_data_packages(query = request)

# Test use of `make_query` inside of `search_data_packages`
EDIutils::search_data_packages(query = make_query(excl_scopes = "knb-lter-fce",
                                                  return_fields = c("title", "id")))

@njlyon0
Copy link
Author

njlyon0 commented Feb 23, 2024

Related Function

I just heard about the query function in the dataone package which seems like it could be a nice 'middle path' for constructing Solr queries (see here).

Users can create their own Solr queries (A) by hand/manually, (B) by supplying a named list that breaks queries into four chunks, or (C) by using something like the function I supplied above where each Solr parameter is mapped to a separate argument.

I'm biased but I think the mapping of each parameter to its own argument is novel enough (relative to dataone::query) that it still warrants inclusion as its own function but I wanted to point out that a similar function does already exist

@clnsmth
Copy link
Contributor

clnsmth commented Feb 23, 2024

This is great @njlyon0, thanks for the draft! I'll give it a test drive and return with some feedback.

@clnsmth
Copy link
Contributor

clnsmth commented Feb 23, 2024

Related to #36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants