Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
neesjanvaneck committed Jan 22, 2024
1 parent a3feec6 commit d1d0410
Showing 1 changed file with 280 additions and 1 deletion.
281 changes: 280 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,282 @@
# publicationclassificationlabeling

This Java package can be used to obtain labels for clusters of scientific publications.
This Java package can be used to obtain labels for clusters of scientific publications. These clusters can be created using the publicationclassification package.

Labels are obtained based on the titles of a sample of publications in each cluster. The package uses OpenAI GPT language models. It supports the [GPT-3.5 and Updated GPT-3.5 Turbo models](https://platform.openai.com/docs/models/gpt-3-5) as well as the [GPT-4 and GPT-4 Turbo models](https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo).

The publicationclassificationlabeling package was developed by [Nees Jan van Eck](https://orcid.org/0000-0001-8448-4521) at the [Centre for Science and Technology Studies (CWTS)](https://www.cwts.nl) at [Leiden University](https://www.universiteitleiden.nl/en).

## Documentation

Documentation of the source code of publicationclassificationlabeling is provided in the code in `javadoc` format. The documentation is also available in a [compiled format](https://CWTSLeiden.github.io/publicationclassificationlabeling).

## Installation

### Maven

```
<dependency>
<groupId>nl.cwts</groupId>
<artifactId>publicationclassificationlabeling</artifactId>
<version>1.0.0</version>
</dependency>
```

### Gradle

```
implementation group: 'nl.cwts', name: 'publicationclassificationlabeling', version: '1.0.0'
```

## Usage

The publicationclassificationlabeling package requires Java 8 or higher. The latest version of the package is available as a pre-compiled `jar` file on [Maven Central](https://central.sonatype.com/artifact/nl.cwts/publicationclassificationlabeling) and [GitHub Packages](https://github.com/CWTSLeiden/publicationclassificationlabeling/packages).
Instructions for compiling the source code of the package are provided [below](#development-and-deployment).

Use the command-line tool `PublicationClassificationLabelingCreator` to obtain cluster labels. The tool can be run as follows:

```
java -cp publicationclassificationlabeling-1.0.0.jar nl.cwts.publicationclassificationlabeling.PublicationClassificationLabelingCreator
```

If no further arguments are provided, the following usage notice will be displayed:

```
PublicationClassificationLabelingCreator version 1.0.0
By Nees Jan van Eck
Centre for Science and Technology Studies (CWTS), Leiden University
Usage: PublicationClassificationLabelingCreator
<pub_titles_file> <label_file>
<api_key> <gpt_model> <print_labeling>
(to create a publication classification labeling based on data in text files)
or PublicationClassificationLabelingCreator
<server> <database> <pub_titles_table> <label_table>
<api_key> <gpt_model> <print_labeling>
(to create a publication classification labeling based on data in an SQL Server database)
Arguments:
<pub_titles_file>
Name of the publication titles input file. This text file must contain two tab-separated
columns (without a header line): a column of cluster numbers and a column of publication
titles. The cluster numbers in the first column must be integers starting at zero. The
publication titles in the second column (e.g., the titles of a sample of 100 publications)
must be concatenated into a single string. The lines in the file must be sorted by the
cluster numbers in the first column.
<label_file>
Name of the labels output file. This text file will contain six tab-separated columns
(without a header line): a column of cluster numbers, a column of short labels, a column
of long labels, a column of keywords, a column of descriptions, and a column of Wikipedia
page links. Cluster numbers are integers starting at zero.
<server>
SQL Server server name. A connection will be made using integrated authentication.
<database>
Database name.
<pub_titles_table>
Name of the publication titles input table. This table must have two columns: cluster_no
and pub_titles. The cluster numbers in the first column must be integers starting at zero.
The publication titles in the second column (e.g., the titles of a sample of 100
publications) must be concatenated into a single string.
<label_table>
Name of the labels output table. This table will have six columns: cluster_no,
short_label, long_label, keywords, summary, and wikipedia_url. Cluster numbers are
integers starting at zero.
<api_key>
OpenAI API key.
<gpt_model>
OpenAI GPT model. The models supported are: 'gpt-4-1106-preview', 'gpt-4',
'gpt-3.5-turbo-1106', and 'gpt-3.5-turbo'.
<print_labeling>
Boolean indicating whether the generated publication classification labeling should be
printed to the standard output or not.
```

### Example

The following example illustrates the use of the `PublicationClassificationLabelingCreator` tool. Suppose you have a text file `cluster_pub_titles.txt`:

```
0 The link-prediction problem for social networks Twitter Power: Tweets as Electronic Word of Mout...
1 The journal coverage of Web of Science and Scopus: a comparative analysis What do citation count...
2 Social network analysis: a powerful strategy, also for the information sciences Google Scholar, ...
3 The sharing economy: Why people participate in collaborative consumption How open is innovation?...
4 Academic engagement and commercialisation: A review of the literature on university-industry rel...
5 Growth rates of modern science: A bibliometric analysis based on the number of publications and ...
6 The determinants of national innovative capacity Citations, family size, opposition and the valu...
7 Developing a framework for responsible innovation Technologies of humility: Citizen participatio...
8 Theory and practise of the g-index An approach for detecting, quantifying, and visualizing the e...
9 Technological transitions as evolutionary reconfiguration processes: a multi-level perspective a...
10 Software survey: VOSviewer, a computer program for bibliometric mapping CiteSpace II: Detecting ...
```

The `PublicationClassificationLabelingCreator` tool can then be run as follows:

```
java -cp publicationclassificationlabeling-1.0.0.jar nl.cwts.publicationclassificationlabeling.PublicationClassificationLabelingCreator cluster_pub_titles.txt label.txt <your OpenAI API key> gpt-3.5-turbo-1106 true
```

The cluster labels obtained using the tool can be found in the text file `label.txt`:

```
0 Information Retrieval Information Retrieval and Knowledge Management ...
1 Bibliometric Analysis Bibliometric Analysis and Research Evaluation ...
2 Scientific Collaboration Patterns and Impact of Scientific Collaboration ...
3 Open Innovation Open Innovation and Collaborative Knowledge Sharing ...
4 University-Industry Relations University-Industry Relations and Technology Transfer ...
5 Scholarly Communication Scholarly Communication in the Digital Age ...
6 Innovation Studies Determinants of National Innovative Capacity and Patent Analysis ...
7 Research Impact Assessment Assessing the Societal Impact of Research ...
8 Bibliometric Analysis Bibliometric Analysis in Scholarly Communication ...
9 Technological Transitions Technological Transitions as Evolutionary Reconfiguration Processes ...
10 Bibliometric Mapping Bibliometric Mapping and Interdisciplinary Research Analysis ...
```

The tool displays the following output:

```
PublicationClassificationLabelingCreator version 1.0.0
By Nees Jan van Eck
Centre for Science and Technology Studies (CWTS), Leiden University
Reading publication titles from file... Finished!
Reading publication titles from file took 0h 0m 0s.
Creating labeling for each cluster...
Creating labeling cluster 0... Finished!
Labeling:
Short label: Information Retrieval
Long label: Information Retrieval and Knowledge Management
Keywords: Information Retrieval; Knowledge Management; Social Networks; Sentiment Analysis; User Engagement; Web Searching; Information Literacy; Data Mining; Online Communities; Credibility Assessment
Summary: This cluster of papers focuses on information retrieval, knowledge management, and related topics such as social networks, sentiment analysis, user engagement, web searching, information literacy, data mining, online communities, and credibility assessment.
Wikipedia: https://en.wikipedia.org/wiki/Information_retrieval
Creating labeling cluster 1... Finished!
Labeling:
Short label: Bibliometric Analysis
Long label: Bibliometric Analysis and Research Evaluation
Keywords: Bibliometric Analysis; Research Evaluation; Citation Impact Indicators; Journal Rankings; University Research Funding; Scientific Performance Measurement; Peer Review Bias; Interdisciplinary Research; Altmetrics; Publication Delay
Summary: This cluster of papers focuses on bibliometric analysis, research evaluation, and the use of citation impact indicators in assessing scientific performance. It covers topics such as journal rankings, university research funding systems, peer review bias, interdisciplinary research, altmetrics, and publication delay. The papers also delve into the challenges and implications of using various metrics to measure research productivity and impact.
Wikipedia: https://en.wikipedia.org/wiki/Bibliometrics
Creating labeling cluster 2... Finished!
Labeling:
Short label: Scientific Collaboration
Long label: Patterns and Impact of Scientific Collaboration
Keywords: Scientific Collaboration; Bibliometrics; Research Impact; International Collaboration; Co-authorship Networks; Research Productivity; Knowledge Production; Citation Analysis; Gender Differences; Technology Innovation
Summary: This cluster of papers focuses on the patterns and impact of scientific collaboration, bibliometrics, research productivity, and knowledge production. It explores topics such as international collaboration, co-authorship networks, research impact, gender differences in research productivity, and technology innovation. The papers analyze the relationship between innovation and subjective wellbeing, the growth of indexed journals in Latin America and the Caribbean, and the feasibility of text mining techniques to detect similarity between patent documents and scientific publications.
Wikipedia: https://en.wikipedia.org/wiki/Scientific_collaboration
Creating labeling cluster 3... Finished!
Labeling:
Short label: Open Innovation
Long label: Open Innovation and Collaborative Knowledge Sharing
Keywords: Open Innovation; Collaborative Consumption; Knowledge Sharing; R&D Cooperation; Innovation Performance; Environmental Innovation; SMEs; Crowdsourcing; Absorptive Capacity; User Innovations
Summary: This cluster of papers explores the concept of open innovation, collaborative consumption, and knowledge sharing in the context of R&D cooperation, innovation performance, environmental innovation, and SMEs. It delves into the dynamics of crowdsourcing, absorptive capacity, and user innovations, emphasizing the importance of collaborative networks for driving innovation.
Wikipedia: https://en.wikipedia.org/wiki/Open_innovation
Creating labeling cluster 4... Finished!
Labeling:
Short label: University-Industry Relations
Long label: University-Industry Relations and Technology Transfer
Keywords: University-Industry Relations; Technology Transfer; Entrepreneurial University; Innovation; Academic Entrepreneurship; Incubator; Spin-off Companies; Knowledge Transfer; Venture Capital; Science Parks
Summary: This cluster of papers explores the dynamics of university-industry relations, technology transfer, and the entrepreneurial activities of academic institutions. It delves into topics such as the impact of organizational practices on technology transfer, factors influencing university-industry collaboration, the role of academic entrepreneurship, and the effectiveness of incubators in fostering innovation and new venture creation.
Wikipedia: https://en.wikipedia.org/wiki/University-industry_collaboration
Creating labeling cluster 5... Finished!
Labeling:
Short label: Scholarly Communication
Long label: Scholarly Communication in the Digital Age
Keywords: Altmetrics; Open Access; Social Media; Bibliometrics; Research Impact; Scientific Collaboration; Academic Networking; Citation Analysis; Webometrics; Research Data Management
Summary: This cluster of papers explores the impact of digital technologies on scholarly communication, including the use of altmetrics, open access publishing, social media, and research data management. It also delves into topics such as citation analysis, scientific collaboration, academic networking, and webometrics.
Wikipedia: https://en.wikipedia.org/wiki/Scholarly_communication
Creating labeling cluster 6... Finished!
Labeling:
Short label: Innovation Studies
Long label: Determinants of National Innovative Capacity and Patent Analysis
Keywords: Innovation; Patent; Technology; Knowledge Flow; National Innovation System; Entrepreneurship; R&D Spillovers; Intellectual Property Rights; Science-Technology Linkage; Innovation Policy
Summary: This cluster of papers explores the determinants of national innovative capacity, patent analysis, technology as a complex adaptive system, knowledge flow, entrepreneurship, R&D spillovers, and the impact of intellectual property rights on innovation. It delves into the interplay between science and technology, innovation policy, and the role of national innovation systems in economic development.
Wikipedia: https://en.wikipedia.org/wiki/Innovation
Creating labeling cluster 7... Finished!
Labeling:
Short label: Research Impact Assessment
Long label: Assessing the Societal Impact of Research
Keywords: Research Impact Assessment; Scientific Collaboration; Innovation Policy; Interdisciplinary Research; Knowledge Transfer; Academic Entrepreneurship; Science Policy Interfaces; University-Industry Collaboration; Bibliometric Analysis; Societal Relevance
Summary: This cluster of papers focuses on assessing the societal impact of research, including topics such as research impact assessment, scientific collaboration, innovation policy, interdisciplinary research, knowledge transfer, academic entrepreneurship, science policy interfaces, university-industry collaboration, and bibliometric analysis. The papers explore the influence of funding agencies, international collaboration, gender differences in research collaboration, and the public understanding of science. They also discuss the challenges and opportunities in evaluating the effectiveness of science-policy interfaces and highlight the importance of societal relevance in research.
Wikipedia: https://en.wikipedia.org/wiki/Research_impact_assessment
Creating labeling cluster 8... Finished!
Labeling:
Short label: Bibliometric Analysis
Long label: Bibliometric Analysis in Scholarly Communication
Keywords: h-index; citation analysis; bibliometric indicators; research impact; co-authorship networks; Google Scholar; Scopus; scientific evaluation; publication output; academic collaboration
Summary: This cluster of papers focuses on the analysis of bibliometric indicators, such as the h-index, citation counts, and co-authorship networks, to evaluate research impact and scholarly communication. It compares data sources like Google Scholar and Scopus, explores the influence of self-citation, and discusses the challenges and benefits of using various metrics for scientific evaluation.
Wikipedia: https://en.wikipedia.org/wiki/Bibliometrics
Creating labeling cluster 9... Finished!
Labeling:
Short label: Technological Transitions
Long label: Technological Transitions as Evolutionary Reconfiguration Processes
Keywords: Sustainability Transitions; Innovation Systems; Multi-level Perspective; Intermediaries; Knowledge Diffusion; Policy Mixes; Business Models; Regional Innovation Systems; Socio-technical Regimes; Demand-side Policies
Summary: This cluster of papers explores technological transitions as evolutionary reconfiguration processes, focusing on sustainability transitions, innovation systems, multi-level perspective, intermediaries, knowledge diffusion, policy mixes, business models, regional innovation systems, socio-technical regimes, and demand-side policies.
Wikipedia: https://en.wikipedia.org/wiki/Technological_transition
Creating labeling cluster 10... Finished!
Labeling:
Short label: Bibliometric Mapping
Long label: Bibliometric Mapping and Interdisciplinary Research Analysis
Keywords: Bibliometric Mapping; Interdisciplinary Research; Scientific Literature; Citation Analysis; Co-citation Networks; Knowledge Structure; Science Mapping Software; Research Fronts; Author Cocitation Analysis; Topic Modeling
Summary: This cluster of papers focuses on the analysis and visualization of scientific literature through bibliometric mapping, citation analysis, and co-citation networks. It explores interdisciplinary research, knowledge structure, and the use of various software tools for science mapping. The papers also delve into author cocitation analysis, research fronts, and topic modeling to understand the evolution and connections within different research fields.
Wikipedia: https://en.wikipedia.org/wiki/Bibliometrics
Creating labeling for each cluster took 0h 0m 37s.
Writing labeling to file... Finished!
Writing labeling to file took 0h 0m 0s.
```

## License

The publicationclassificationlabeling package is distributed under the [MIT license](LICENSE).

## Issues

If you encounter any issues, please report them using the [issue tracker](https://github.com/CWTSLeiden/publicationclassificationlabeling/issues) on GitHub.

## Contribution

You are welcome to contribute to the development of the publicationclassificationlabeling package. Please follow the typical GitHub workflow: Fork from this repository and make a pull request to submit your changes.
Make sure that your pull request has a clear description and that the code has been properly tested.

## Development and deployment

The latest stable version of the source code is available in the [`main`](https://github.com/CWTSLeiden/publicationclassificationlabeling/tree/main) branch on GitHub. The most recent version of the source code, which may be under development, is available in the [`develop`](https://github.com/CWTSLeiden/publicationclassificationlabeling/tree/develop) branch.

### Compilation

To compile the source code of the publicationclassificationlabeling package, a [Java Development Kit](https://jdk.java.net) needs to be installed on your system (version 8 or higher). Having [Gradle](https://www.gradle.org) installed is optional as the [Gradle Wrapper](https://docs.gradle.org/current/userguide/gradle_wrapper.html) is also included in this repository.

On Windows systems, the source code can be compiled as follows:

```
gradlew build
```

On Linux and MacOS systems, use the following command:

```
./gradlew build
```

The compiled `class` files can be found in the directory `build/classes`.
The compiled `jar` file can be found in the directory `build/libs`.
The compiled `javadoc` files can be found in the directory `build/docs`.

The class `nl.cwts.publicationclassificationlabeling.run.PublicationClassificationLabelingCreator` has a `main` method. After compiling the source code, the `PublicationClassificationLabelingCreator` tool can be run as follows:

```
java -cp build/libs/publicationclassificationlabeling-<version>.jar nl.cwts.publicationclassificationlabeling.run.PublicationClassificationLabelingCreator
```

0 comments on commit d1d0410

Please sign in to comment.