Time Series Forecasting with LLMs:
Understanding and Enhancing Model Capabilities

Hua Tang2,, Chong Zhang 3,*, Mingyu Jin1,*,  Qinkai Yu3,  Zhenting Wang1,
Xiaobo Jin5, Yongfeng Zhang 1, Mengnan Du 4
1Rutgers University  2Shanghai Jiao Tong University
3University of Liverpool 4New Jersey Institute of Technology
5Xi’an Jiaotong-Liverpool University
Equal Contribution.
Abstract

Large language models (LLMs) have been applied in many fields and have developed rapidly in recent years. As a classic machine learning task, time series forecasting has recently been boosted by LLMs. Recent works treat large language models as zero-shot time series reasoners without further fine-tuning, which achieves remarkable performance. However, there are some unexplored research problems when applying LLMs for time series forecasting under the zero-shot setting. For instance, the LLMs’ preferences for the input time series are less understood. In this paper, by comparing LLMs with traditional time series forecasting models, we observe many interesting properties of LLMs in the context of time series forecasting. First, our study shows that LLMs perform well in predicting time series with clear patterns and trends but face challenges with datasets lacking periodicity. This observation can be explained by the ability of LLMs to recognize the underlying period within datasets, which is supported by our experiments. In addition, the input strategy is investigated and it is found that incorporating external knowledge and adopting natural language paraphrases substantially improve the predictive performance of LLMs for time series. Overall, our study contributes insight into LLMs’ advantages and limitations in time series forecasting under different conditions.

1 Introduction

Recently, large language models (LLMs) have been widely used and have achieved promising performance across various domains, such as health management, customer analysis, and text feature mining [16, 15, 12]. Time series forecasting requires extrapolation from sequential observations. Language models are designed to discern intricate concepts within temporally correlated sequences, and intuitively appear well-suited for this task. Hence, there exist some preliminary studies that apply LLMs to time series forecasting tasks [10, 17, 19].

However, currently the application of LLMs for time series forecasting is still in its early stage, and the boundaries of this research area are not yet well defined. There are many unexplored problems in this field. For example, existing research lacks exploration into how the performance of LLMs varies when faced with different types of time series inputs. This includes the effectiveness gap for LLMs in predicting data with seasonal and trending patterns versus data without such patterns.

To fill this research gap, in this paper, we focus on LLMs’ preferences for the input time series in time series forecasting under the zero shot prompting setting. Through experiments on both real and synthesized datasets, we find that LLMs perform better in time series with higher trend or seasonal strengths. Our observations also reveal that LLMs perform worse when there are multiple periods within datasets, which may be attributed to the fact that LLMs cannot capture the distinct periods within those datasets. To further discern the LLMs’ preferences for the specific segments of the input data, we design counterfactual experiments involving systematic permutations of input sequences. The findings suggest that LLMs are particularly sensitive to the segment of input sequences closest to the target output.

Based on the above findings, we want to further explore why LLMs forecast well on datasets with higher seasonal strengths. To this end, we require LLMs to tell the period of the datasets through multiple runs. We find that LLMs can mostly recognize the underlying period of a dataset. This can explain the findings of why large language models can forecast time series with high trends or seasonal intensities well, since they can obtain the seasonal pattern inside the datasets.

In light of the above-mentioned findings, we are interested in how to leverage these insights to further improve model performance. To address this, we propose two simple techniques to enhance model performance: incorporating external human knowledge and converting numerical sequences into natural language counterparts. Incorporating supplementary information enables large language models to more effectively grasp the periodic nature of time series data, moving beyond a mere emphasis on the tail of the time series. Transforming numerical data into a natural language format enhances the model’s ability to comprehend and reason, also serving as a beneficial approach. Both approaches improve model performance and contribute to our understanding of LLMs in time series forecasting. The workflow is illustrated in Figure 1.

The key contributions are as follows:

  • We investigate the preferences for the input sequences in LLMs in time series forecasting tasks. Our analysis has revealed that LLMs significantly outperform traditional time series forecasting methods without the need for additional fine-tuning. Interestingly, LLMs display superior predictive capabilities when dealing with datasets that have higher trends and seasonal strengths.

  • We require LLMs to identify the periodicity of datasets across multiple iterations. Our observations indicate that LLMs can effectively recognize the inherent periodic patterns within datasets. This observation answers the question of why LLMs perform well in forecasting time series with higher seasonal strengths, as they can capture the seasonal patterns inherent in the data.

  • We propose two simple techniques to improve model performance and find that both incorporating external human knowledge into input prompts and paraphrasing input sequences to natural language substantially improve the performance of LLMs in time series forecasting.

Refer to caption
Figure 1: The workflow of our analysis process. The workflow of our analysis involves processing sequence data using different tokenization and embedding methods with various LLMs, such as GPTs and Gemini. To analyze the preferences of LLMs, we compute the seasonal and trend strength inside the datasets. Our experiments illuminate that LLMs prefer series with higher seasonal and trend strengths. To elucidate the rationale behind our findings, we demand the LLMs to identify the underlying periods, revealing that the model can effectively recognize the underlying periods in most cases. In addition, in order to improve the performance of time series forecasting, we propose two approaches to the user input: for the input prompt, we incorporate human knowledge regarding the dataset sources; and for the input sequence, we reprogram the data into natural language sequences. Both methods result in substantially improved model performance.

2 Preliminaries

2.1 Large Language Model

We use LLMs as a zero-shot learner for time series forecasting by treating numerical values as text sequences. In this paper, we investigate three close source LLMs, including GPT-3.5-turbo, GPT-4-turbo, and Gemini-1.0-Pro, and one open-source LLMs, i.e., llama-2-13B. The success of LLMs in time series forecasting can significantly depend on correct pre-processing and handling of the data [10]. We followed the pre-processing approach of Gruver [10] and this process involves the following few steps.

Input Pre-processing.  In this phase for time series forecasting with LLMs, we perform two pre-processing steps. First, numerical values are transformed into strings, a crucial step that significantly influences the model’s comprehension and data processing. For instance, a series like 0.123, 1.23, 12.3, 123.0 is reformatted to "1 2, 1 2 3, 1 2 3 0, 1 2 3 0 0", introducing spaces between digits and commas to delineate time steps, while decimal points are omitted to save token space. Second, tokenization is equally important, shaping the model’s pattern recognition capabilities. Unlike traditional methods such as byte-pair encoding (BPE) [13], which can disrupt numerical coherence, we use spacing digits which ensures individual tokenization, enhancing pattern discernment. Third, rescaling is employed to efficiently utilize tokens and manage large inputs by adjusting values so that a specific percentile aligns to 1. This facilitates the model’s exposure to varying digit counts and supports the generation of larger values, a testament to the nuanced yet critical nature of data preparation in leveraging LLMs for time series analysis.

2.2 Time Series Forecasting

In the context of time-series forecasting, the primary goal is to predict the values for the next H𝐻Hitalic_H steps based on observed values from the preceding K𝐾Kitalic_K steps, which is mathematically expressed as:

X^t,,X^t+H1=F(Xt1,,XtK;V;λ)subscript^𝑋𝑡subscript^𝑋𝑡𝐻1𝐹subscript𝑋𝑡1subscript𝑋𝑡𝐾𝑉𝜆\hat{X}_{t},...,\hat{X}_{t+H-1}=F(X_{t-1},...,X_{t-K};V;\lambda)over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_H - 1 end_POSTSUBSCRIPT = italic_F ( italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_K end_POSTSUBSCRIPT ; italic_V ; italic_λ ) (1)

Here, X^t,,X^t+H1subscript^𝑋𝑡subscript^𝑋𝑡𝐻1\hat{X}_{t},...,\hat{X}_{t+H-1}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_H - 1 end_POSTSUBSCRIPT represent the H𝐻Hitalic_H-step estimation given the previous K𝐾Kitalic_K-step values Xt1,,XtKsubscript𝑋𝑡1subscript𝑋𝑡𝐾X_{t-1},...,X_{t-K}italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_K end_POSTSUBSCRIPT. λ𝜆\lambdaitalic_λ denotes the trained parameters from the model F𝐹Fitalic_F, and V𝑉Vitalic_V denotes the prompt or any other information used for inference. In this paper, we focus predominantly on univariate time series forecasting to investigate the preference and performance of LLMs in univariate time series forecasting under the zero-shot setting.

Motivated by interpretability requirements in real-world scenarios, time series can often be decomposed into the trend component, the seasonal component, and the residual component through the addictive model [5]. The trend component captures the hidden long-term changes in the data, such as the linear or exponential pattern. The seasonal component captures the repeating variation in the data, and the residual component captures the remaining variation in the data after removing the trend and seasonal components. This decomposition offers a method to quantify the properties of time series, which is detailed in subsection 3.2.

Datasets. In this study, we primarily use Darts [11], a benchmark univariate dataset widely recognized in deep learning research, along with many baseline methods. Darts consists of eight real univariate time series datasets, including those with clear patterns, such as the AirPassengerDataset, and irregular datasets, such as the SunspotsDataset. Besides, we employ some other commonly used datasets, such as US Births Dataset[9], TSMC-Stock and Turkeypower datasets [10] and ETT [22] in Sections 5.1 and 5.2 to demonstrate the effectiveness of our proposed methods. A full description of those datasets can be seen in Appendix A.1.

Evaluation Metrics. In this paper, we evaluate model performance with three metrics: Mean Squared Error (MSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). These metrics are defined as follows:

MSE=1ni=1n(yiy^i)2MSE1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑦𝑖subscript^𝑦𝑖2\displaystyle\text{MSE}=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}MSE = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (2)
MAE=1ni=1n|yiy^i|MAE1𝑛superscriptsubscript𝑖1𝑛subscript𝑦𝑖subscript^𝑦𝑖\displaystyle\text{MAE}=\frac{1}{n}\sum_{i=1}^{n}\left|y_{i}-\hat{y}_{i}\right|MAE = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | (3)
MAPE=1ni=1n|yiy^iyi|MAPE1𝑛superscriptsubscript𝑖1𝑛subscript𝑦𝑖subscript^𝑦𝑖subscript𝑦𝑖\displaystyle\text{MAPE}=\frac{1}{n}\sum_{i=1}^{n}\left|\frac{y_{i}-\hat{y}_{i% }}{y_{i}}\right|MAPE = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | divide start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | (4)

where yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the true value, y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the predicted value, and n𝑛nitalic_n is the sample size.

3 What are LLMs’ Preferences in Time Series Forecasting?

To explore the preference of LLMs, we first quantify the properties of the input time series to investigate the LLMs’ preferences for time series. Then, to further emphasize our findings, we evaluate the importance of different segments of the input sequence by adding Gaussian noise to the original time series.

3.1 Analyzing Method

We first compare the performance between LLMs and traditional time series forecasting methods, as shown in Table 9 and Table 10. It is shown that LLMs perform better within most datasets. GPT-4-turbo and Llama-2 perform relatively well on the AirPassengerdataset and the AusBeerdataset with low MAPE. Gemini outperforms GPT-3.5-turbo on time series forecasting and outperforms GPT-4-turbo on some datasets but is on par with GPT-4-turbo overall.

To understand the preferences of the LLMs, we compare our framework using various foundational models, such as GPT-4-turbo and GPT-3.5-turbo, with traditional methods. We also design experiments on synthesized datasets to validate our findings and analyze the impact of the multiple periods. To quantify the LLMs’ preferences towards time series, following [20], we define the strength of the trend and the seasonality as follows:

QT=1Var(XR)Var(XT+XR),subscript𝑄𝑇1Varsubscript𝑋𝑅Varsubscript𝑋𝑇subscript𝑋𝑅\displaystyle Q_{T}=1-\frac{\text{Var}(X_{R})}{\text{Var}(X_{T}+X_{R})},italic_Q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 1 - divide start_ARG Var ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) end_ARG start_ARG Var ( italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) end_ARG , QS=1Var(XR)Var(XS+XR)subscript𝑄𝑆1Varsubscript𝑋𝑅Varsubscript𝑋𝑆subscript𝑋𝑅\displaystyle Q_{S}=1-\frac{\text{Var}(X_{R})}{\text{Var}(X_{S}+X_{R})}italic_Q start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = 1 - divide start_ARG Var ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) end_ARG start_ARG Var ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) end_ARG (5)

where XKRKsubscript𝑋𝐾superscript𝑅𝐾X_{K}\in{R}^{K}italic_X start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, XSRKsubscript𝑋𝑆superscript𝑅𝐾X_{S}\in{R}^{K}italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT and XRRKsubscript𝑋𝑅superscript𝑅𝐾X_{R}\in{R}^{K}italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT denote the trend component, the seasonal component and the residual component respectively. The presented indices indicate the trend’s strength and seasonality, providing a measure ranging up to 1. It is easy to find that a higher value indicates a stronger trend or seasonality within the time series. Throughout this paper, we use the word "higher strength" to represent the comparison of the strengths between different datasets. The assessment of strength is not based on a fixed level, as the concepts of "strong" and "weak" vary across different datasets and scenarios.

To further discern the LLMs’ preferences for the specific segments of the input data, we add Gaussian noise to the original time series to create counterfactual examples. We start by defining a sliding window that constitutes 10% of the total length of the time series, and we set the sliding window to gradually move closer to the output sequence. This method allows us to assess the impact of different segments fairly and thereby infer the interpretability of the time series segments that LLMs predominantly focus on.

3.2 Preferences for Input Sequences

In this subsection, we investigate the input sequence preferences for time series forecasting with LLMs. We conduct experiments on real datasets with GPT-3.5-turbo and GPT-4-turbo, measuring model performance through MAPE. To further validate our findings, we also use GPT-3.5-turbo and Gemini-1.0-Pro to forecast multiple-period time series on synthesized datasets.

Table 1: Correlation matrix between the strengths of the input time series and the model performance.
Metrics GPT4-MAPE GPT3.5-MAPE Trend Strength QTsubscript𝑄𝑇Q_{T}italic_Q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT Seasonal Strength QSsubscript𝑄𝑆Q_{S}italic_Q start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT
GPT4-MAPE 1.00000 0.987398 -0.020637 -0.681440
GPT3.5-MAPE 0.987398 1.00000 -0.115087 -0.669983
Trend Strength QTsubscript𝑄𝑇Q_{T}italic_Q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT -0.020637 -0.115087 1.00000 0.508980
Seasonal Strength QSsubscript𝑄𝑆Q_{S}italic_Q start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT -0.681440 -0.669983 0.508980 1.00000

3.2.1 Implementation Details

Real Datasets:  We conduct experiments on ten real-world datasets, including both those with clear patterns and those with irregular characteristics. The results are shown in Table 6. We apply the Seasonal-Trend decomposition using the LOESS (STL) technique [5] to decompose the original time series into trend, seasonal, and residual components. Subsequently, we compute the strengths of the trend strength QTsubscript𝑄𝑇Q_{T}italic_Q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and seasonal strength QSsubscript𝑄𝑆Q_{S}italic_Q start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. To further understand the LLMs’ preferences for the specific segments of the input data, we conduct the counterfactual analysis with a systematic permutation to the input time series. We first scale the sequence through max-min normalization. We then define a sliding window that constitutes 10% of the total length of the time series and add Gaussian noise into the data within this window data. Subsequently, the sliding window moves closer to the last known data point.

Refer to subsubsection A.2.1 for detailed information.

Synthesized Datasets:  To further validate our findings and investigate the influence of the number of periods on model performance, we generate a dataset using the function y=αx+β1cos(2πf1x)+β2cos(2πf2x)+ϵ𝑦𝛼𝑥subscript𝛽1𝑐𝑜𝑠2𝜋subscript𝑓1𝑥subscript𝛽2𝑐𝑜𝑠2𝜋subscript𝑓2𝑥italic-ϵy=\alpha*x+\beta_{1}*cos(2\pi f_{1}*x)+\beta_{2}*cos(2\pi f_{2}*x)+\epsilonitalic_y = italic_α ∗ italic_x + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∗ italic_c italic_o italic_s ( 2 italic_π italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∗ italic_x ) + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∗ italic_c italic_o italic_s ( 2 italic_π italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∗ italic_x ) + italic_ϵ. x𝑥xitalic_x ranges from 0 to 20 and ϵitalic-ϵ\epsilonitalic_ϵ follows the normal distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ). Refer to subsubsection A.2.1 for detailed information.

3.2.2 Key Findings

After computing the Pearson correlation coefficients (PCC), we observe a nearly strong correlation between the strengths and model performance, showing that LLMs perform better when the input time series has a higher trend and seasonal strength, which is shown in Table 1.In the context of multi-period time series, the model performance worsens as the number of periods increases. It indicates that LLMs may have difficulty recognizing the multiple periods inherent in such datasets. This potentially stems from their inability to adequately capture long-term periods, which can be supported in Figure 8. Besides, for counterfactual analysis, as shown in Figure 2 and Figure 3, there is a noticeable increase in MAPE values when Gaussian noise is added to the latter segments, while the perturbation of the first part of the sequence has little effect on the prediction performance. Our findings reveal that LLMs are more sensitive to the end of input time series when forecasting. We show our full results in Figure 6 and Figure 7. As we move to the right along the x-axis, the closer it gets to the output sequence. We defer the details in Appendix A.

It is also found that the initial part of the sequence has the least impact on the prediction accuracy. For the datasets with high seasonal strengths over 85%, such as WoolyDataset, and MonthlymilkDataset, more than 80% of the length of the time series has almost no effect on the model performance.

4 Why do LLMs Forecast Well on Data with Higher Seasonal Strengths?

Our findings show that LLMs demonstrate enhanced performance in time series forecasting with strong seasonal strengths. This raises the question: Why do LLMs perform well in forecasting datasets with marked seasonal patterns? To explore this phenomenon, we craft prompts that require LLMs to recognize the dataset’s temporal pattern.

This approach is grounded in the hypothesis that LLMs’ proficiency in handling datasets with distinct seasonal attributes. By explicitly prompting LLMs to predict the dataset’s period, we aim to leverage their inherent ability to discern and extrapolate from complex patterns, which sheds light on the mechanisms that underpin their superior performance in such contexts.

4.1 Implementation Details

To explore the phenomenon that LLMs forecast well on datasets with higher seasonal strengths, we design experiments to verify this phenomenon. We tokenize the input sequence and let the LLMs output the period directly. We use GPT-3.5-turbo, GPT-4-turbo and Gemini-1.0-Pro to predict the periods. We have chosen five datasets with their seasonal strengths exceeding 85%. These datasets are readily available with clear seasonal patterns. In contrast, determining the specific periods of other irregular datasets is challenging, as they have no specific cycles. We record the predicted periods ten times and identify the mode period, which is the most frequently predicted value. We then compare the mode of these ten results with the real period. The mode is selected as the evaluation metric because, when considering the usage characteristics of LLMs, the output of this number best represents the model’s normal performance. We defer our details of the prompt in subsection A.3, and the results are shown in Table 2.

4.2 Key Findings

According to the results, we find that large language models can mostly determine the periodicity of a dataset. The true periods are determined here by the periodogram, which is commonly used to identify the dominant periods [1]. The multiples of the predicted period also align with the original data cycle. Consequently, we consider the prediction of these multiples to be accurate. We observe that LLMs generally perform well in predicting the period for most datasets with minimal fluctuations. Surprisingly, we discover that in the case of WoolyDataset and AusbeerDataset, which possess relatively short underlying periods, the predicted period is consistently 3 instead of the true period, 4. This discrepancy may be attributed to the LLMs’ tendency to focus on cyclic patterns among individual digits rather than considering the entire sequence as a whole, a phenomenon that could also be interpreted as the model’s identification of the underlying cycle. We leave a comprehensive analysis of this phenomenon in the future.

Refer to caption
(a) TSMCStock
Refer to caption
(b) IstanbulTrafficDataset
Refer to caption
(c) MonthlyMilk
Figure 2: Experiments of Sequence Focused Attention Through Counterfactual Explanation on GPT-3.5-turbo.
Refer to caption
(a) TSMCStock
Refer to caption
(b) IstanbulTrafficDataset
Refer to caption
(c) MonthlyMilk
Figure 3: Experiments of Sequence Focused Attention Through Counterfactual Explanation on Gemini-Pro-1.0.

5 How to Leverage These Insights to Improve the Model’s Performance?

Based on the findings in the previous two sections, our focus is now on how to leverage these findings to further improve model performance. In this paper, we propose two approaches to the user input without additional fine-tuning: for the input prompt, we incorporate additional knowledge of the specific trend and seasonal patterns in the dataset, which gives the model a richer understanding of the underlying patterns. Regarding the input sequence, we transform the time series data into formats resembling natural language sequences rather than relying on the original tokenization. This approach leverages LLMs’ superior capabilities with language sequences. Both methods achieve substantially improved model performance.

5.1 External Knowledge Enhancing Time Series Forecasting

We introduce a novel method to improve the performance of large language models for time series forecasting. The core idea of this part is to use the knowledge obtained from the pre-training stage to help predict. We provide the large language model with some basic information about the current dataset such as the background of the data collection, and this process does not involve data leakage. We incorporate our tests on the data leakage in Appendix A.4. It is noted that we do not provide the LLMs with any statistical information such as the periods or trends. This approach ensures that the LLMs forecast the time series entirely based on the data and their prior knowledge. Let Vssubscript𝑉𝑠V_{s}italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT denote the initial prompt representing the original time sequence, and let z𝑧zitalic_z denote the additional information. Consequently, the new prompt Vesubscript𝑉𝑒V_{e}italic_V start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT can be expressed as: Ve=z+Vssubscript𝑉𝑒𝑧subscript𝑉𝑠V_{e}=z+V_{s}italic_V start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_z + italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

5.1.1 Implementation Details

We input the dataset’s external knowledge through prompts before the sequence’s input. The external knowledge of each dataset is presented in subsection A.1. The results are shown in Table 8, where LLMTime Prediction refers to the approach described by [10] without any modifications.

5.1.2 Key Findings

As shown in Table 8, this method achieves improved performance in most scenarios. Besides, GPT-4-turbo generally performs better than GPT-3.5-turbo on MSE, MAE, and MAPE, especially on AirPassengers, AusBeer, and other datasets. Llama-2 significantly outperforms GPT-3.5-turbo and GPT-4-turbo in terms of MSE and MAE metrics on some datasets (e.g., Wooly, ETTh1, ETTm2), indicating that it can capture data features more accurately. Using External Knowledge Enhancing, Gemini outperforms other models on MonthlyMilk, Sunspots, Wooly, and HeartRate Datasets, but performs poorly on other datasets.

5.2 Natural Language Paraphrasing

In this subsection, we conduct experiments on the natural language paraphrasing of the input time sequences. This strategy capitalizes on the advanced abilities of large language models in handling language sequences. It is motivated by the fact that LLMs are insensitive by the order of magnitude and size of digits [18].

We use natural language to describe the trend between consecutive values. For instance, given a time series X𝑋Xitalic_X where X=[X1,X2,X3,,Xn]𝑋subscript𝑋1subscript𝑋2subscript𝑋3subscript𝑋𝑛X=[X_{1},X_{2},X_{3},\ldots,X_{n}]italic_X = [ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], we describe the trend from Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to Xt+1subscript𝑋𝑡1X_{t+1}italic_X start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT as follows: "The value rises from Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to Xt+1subscript𝑋𝑡1X_{t+1}italic_X start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, and falls from Xt+1subscript𝑋𝑡1X_{t+1}italic_X start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT to Xt+2subscript𝑋𝑡2X_{t+2}italic_X start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT…". The string we get here is our natural language paraphrasing sequence. After generating responses based on the string, we extract the values from the text and construct the predicted time series.

Table 2: Results and comparison of time series period prediction based on GPT-3.5-turbo and Gemini.
Model Dataset Period Real Mode
AirPassengersDataset 24 24 7 24 12 24 11 24 24 24 12 24
WineDataset 11 12 24 24 24 20 24 24 24 24 12 24
GPT-3.5-turbo MonthlyMilkDataset 6 9 12 9 12 12 12 12 12 11 12 12
WoolyDataset 4 3 4 3 3 4 3 3 6 3 4 3
AusBeerDataset 3 3 3 3 3 3 3 3 3 3 4 3
AirPassengersDataset 11 12 12 4 12 12 12 12 12 12 12 12
WineDataset 10 12 24 12 6 12 12 24 12 12 12 12
Gemini-Pro-1.0 MonthlyMilkDataset 16 12 12 12 12 39 12 11 12 12 12 12
WoolyDataset 5 7 4 5 4 4 4 4 5 6 4 4
AusBeerDataset 4 4 4 2 5 5 4 3 5 7 4 4
AirPassengersDataset 12 12 12 12 12 12 12 12 12 12 12 12
WineDataset 7 6 6 6 7 7 6 6 6 6 12 6
GPT-4-turbo MonthlyMilkDataset 10 12 12 12 12 14 12 12 12 12 12 12
WoolyDataset 5 5 7 5 5 5 7 5 5 4 4 5
AusBeerDataset 4 4 4 4 6 4 4 4 6 4 4 4

5.2.1 Implementation Details

We use GPT-3.5-Turbo, GPT-4-turbo, Llama-2 and Gemini-Pro-1.0 to forecast the time series, where part of the results are presented in Table 3 due to the page limit. We defer our full results (Table 7) in the Appendix.

5.2.2 Key Findings

According to the results in Table 3, we find that enhancing LLM through natural language paraphrasing improves time series forecasting on most datasets. For instance, GPT-3.5-turbo and GPT-4-turbo perform better on most datasets, especially on Natural Language Paraphrasing methods. Gemini outperforms other LLMs on Wooly and Ausbeer datasets but underperforms on others with natural language paraphrasing. All these results demonstrate the superior performance of our methods.

Table 3: The results of natural language paraphrasing of sequences and baseline comparison(Partial).
Models Datasets Natural Language Paraphrasing LLMTime Prediction
MSE MAE MAPE MSE MAE MAPE
GPT-3.5-Turbo AirPassengers 267.66 3.66 0.99 6244.07 61.39 14.43
AusBeer 598.45 5.81 1.36 841.68 23.59 5.62
GasRateCO2 3.16 0.46 0.85 10.88 2.66 4.73
MonthlyMilk 968.69 8.61 1.02 7507.13 66.28 112.77
Sunspots 251.61 4.27 20.42 6556.55 58.95 217.94
(GPT-3.5-turbo-1106) HeartRate 4.38 0.55 0.57 76.83 7.15 7.42
Istanbul-Traffic 224.17 3.74 8.81 335.05 6.75 11.68
ETTh1 1.21 0.48 54.17 5.64 2.71 1.625
ETTm2 0.81 0.36 27.33 3.46 2.17 1.178
GPT-4-Turbo AirPassengers 133.10 2.87 0.80 1286.25 28.04 6.07
AusBeer 661.80 7.24 1.63 513.49 18.57 4.28
GasRateCO2 2.28 0.41 0.75 7.27 2.32 4.18
MonthlyMilk 413.63 4.94 0.57 4442.18 50.75 172.82
Sunspots 194.52 5.30 16.10 3374.70 41.87 321.11
(GPT-4-turbo-preview) HeartRate 11.64 1.21 1.30 988.14 26.57 29.22
Istanbul-Traffic 176.91 3.88 9.67 195.33 5.53 10.03
ETTh1 1.20 0.49 47.62 4.73 1.53 3.282
ETTm2 0.45 0.27 23.62 2.30 1.034 1.607
Llama-2 AirPassengers 751.34 6.77 1.53 1317.9 55.49 11.18
AusBeer 591.75 23.25 5.41 644.82 17.88 4.08
GasRateCO2 10.16 2.89 5.16 12.78 2.97 5.47
MonthlyMilk 851.17 84.83 9.46 3410.20 41.40 240.25
Sunspots 1483.29 33.27 17.79 4467.67 48.95 91.79
(llama-2-13B) HeartRate 49.8 5.84 6.53 75.58 7.11 7.94
Istanbul-Traffic 306.80 5.39 7.24 438.28 7.28 9.81
ETTh1 1.47 0.87 58.34 4.84 1.79 3.178
ETTm2 0.84 0.41 29.86 3.31 2.07 2.153
Gemini-Pro-1.0 AirPassengers 4474.54 31.54 7.02 6392.21 63.57 14.03
AusBeer 278.45 10.05 2.29 397.78 14.36 3.27
GasRateCO2 13.29 2.50 4.38 18.99 3.57 6.46
MonthlyMilk 440.29 11.91 1.39 628.98 17.01 1.99
Sunspots 438.29 10.47 1.21 626.03 14.94 1.73
(gemini-1.0-pro) HeartRate 40.57 4.20 4.67 57.96 6.01 6.66
Istanbul-Traffic 267.43 5.69 8.37 321.56 7.32 9.71
ETTh1 1.17 0.74 54.86 4.84 1.79 3.178
ETTm2 0.88 0.39 21.82 3.31 2.07 2.153

6 Related Work

In this section, we review two lines of research that are most relevant to ours.

6.1 Traditional Time Series Forecasting

Two commonly used methods for traditional time series analysis are the ARIMA method [2] and the exponential smoothing method [7]. The ARIMA model is a classic forecasting method that breaks down a time series into auto-regressive (AR), difference (I), and moving average (MA) components to make predictions. On the other hand, exponential smoothing is a straightforward yet effective technique that forecasts future values by taking a weighted average of past observations. The ARIMA model requires testing data stationarity and selecting the right order. However, the exponential smoothing method is not affected by outliers, it is only suitable for stationary time series, and its accuracy in predicting future values is lower than the ARIMA model.

6.2 LLMs for Time Series Forecasting

The first family of methods involve either pre-training a foundational large language model or fine-tuning existing LLMs by leveraging extensive time-series data [17, 8, 6, 3]. For instance, [17] aimed to build the foundational models for time series and investigate its scaling behavior. [4] proposed a two-stage fine-tuning strategy for handling multivariate time-series forecasting. Although these studies contribute significantly to understanding foundational models, they require considerable computing resources and expertise in fine-tuning procedures. Moreover, the details of the model may not be disclosed for commercial purposes [8], which impedes future research. Additionally, in scenarios with limited data available, there is insufficient information for training or fine-tuning.

In contrast, the second family of methods does not involve model parameter finetuning. These methods either create appropriate prompts or reprogramme inputs, to effectively handle time series data [10, 19, 14, 21]. [19] tokenizes the time series and manages to embed those tokens, and [14] reprogrammed the time series data with text prototypes before feeding them to the LLMs. These studies illuminate the characteristics of time series data and devise methods to align them with LLMs. However, they lack an analysis of the ability and bias in forecasting time series. The most related work to us is [10], though it lacks a quantitative analysis of the preference for the time series in LLMs, and it fails to explore the impact of input forms and prompt contents, such as converting the numerical time series into the natural language sequences and incorporating the background information into the prompt. Our work fills the gap, and we expect our work to be the benchmark for time-series analysis and provide insights for subsequent research.

7 Conclusions and Future Work

In this work, we investigate the key preferences of LLMs in the domain of time series forecasting under the zero shot setting, revealing a proclivity for data with distinct trends and seasonal patterns. Through a blend of real and synthetic datasets, coupled with counterfactual experiments, we have demonstrated LLMs’ improved forecasting performance with time series that exhibit clear periodicity. Besides, our results indicate that LLMs struggle with multi-period time series datasets, as they face difficulty in recognizing the distinct periods within them. Our findings also suggest that large language models are more sensitive to the segment of input sequences closer to the last known data than other locations. Lastly, experimental results indicate that our proposed strategies of incorporating external knowledge and transforming numerical sequences into natural language formats have yielded substantial improvements in accuracy.

Limitation

This study may be limited in the following ways. First, limitations in the scope of the dataset and large language models may not capture the full variability of the results with a wider array. In addition, some experimental sessions lack a comparison with hard-coded solutions, and there is a gap in understanding the performance of LLMs compared to traditional programming methods. Furthermore, the inability to categorize datasets by type and conduct specific types of experiments limits insight into the model’s performance in different data domains. These limitations suggest that the results could benefit from more extensive experiments and more nuanced analyses, underscoring the need to expand future research.

References

  • [1] Maurice S Bartlett. Periodogram analysis and continuous spectra. Biometrika, 37(1/2):1–16, 1950.
  • [2] George EP Box and David A Pierce. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. Journal of the American statistical Association, 1970.
  • [3] Defu Cao, Furong Jia, Sercan O Arik, Tomas Pfister, Yixiang Zheng, Wen Ye, and Yan Liu. Tempo: Prompt-based generative pre-trained transformer for time series forecasting. arXiv preprint arXiv:2310.04948, 2023.
  • [4] Ching Chang, Wei-Yao Wang, Wen-Chih Peng, and Tien-Fu Chen. Llm4ts: Aligning pre-trained llms as data-efficient time-series forecasters, 2024.
  • [5] Robert B Cleveland, William S Cleveland, Jean E McRae, and Irma Terpenning. Stl: A seasonal-trend decomposition. J. Off. Stat, 1990.
  • [6] Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. arXiv preprint arXiv:2310.10688, 2023.
  • [7] Everette S Gardner Jr. Exponential smoothing: The state of the art—part ii. International journal of forecasting, 2006.
  • [8] Azul Garza and Max Mergenthaler-Canseco. Timegpt-1. arXiv preprint arXiv:2310.03589, 2023.
  • [9] Rakshitha W Godahewa, Christoph Bergmeir, Geoffrey Webb, Rob Hyndman, and Pablo Montero-Manso. Monash time series forecasting archive. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
  • [10] Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters. arXiv preprint arXiv:2310.07820, 2023.
  • [11] Julien Herzen, Francesco Lässig, Samuele Giuliano Piazzetta, Thomas Neuer, Léo Tafti, Guillaume Raille, Tomas Van Pottelbergh, Marek Pasieka, Andrzej Skrodzki, Nicolas Huguenin, Maxime Dumonal, Jan Kościsz, Dennis Bader, Frédérick Gusset, Mounir Benheddi, Camila Williamson, Michal Kosinski, Matej Petrik, and Gaël Grosch. Darts: user-friendly modern machine learning for time series. J. Mach. Learn. Res., 23(1), jan 2022.
  • [12] Allen H Huang, Hui Wang, and Yi Yang. Finbert: A large language model for extracting information from financial text. Contemporary Accounting Research, 2023.
  • [13] Hugging Face. Chapter 6.5 of nlp course. https://huggingface.co/learn/nlp-course/chapter6/5, 2023. Accessed: 2023-02-10.
  • [14] Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728, 2023.
  • [15] Cristina Ledro, Anna Nosella, and Andrea Vinelli. Artificial intelligence in customer relationship management: literature review and future research directions. Journal of Business & Industrial Marketing, 2022.
  • [16] Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of generative large language model for medical research and healthcare. NPJ digital medicine, 6(1):210, 2023.
  • [17] Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš, Hena Ghonia, Nadhir Vincent Hassen, Anderson Schneider, et al. Lag-llama: Towards foundation models for time series forecasting. arXiv preprint arXiv:2310.08278, 2023.
  • [18] Raj Shah, Vijay Marupudi, Reba Koenen, Khushi Bhardwaj, and Sashank Varma. Numeric magnitude comparison effects in large language models. In Findings of the Association for Computational Linguistics: ACL 2023, 2023.
  • [19] Chenxi Sun, Yaliang Li, Hongyan Li, and Shenda Hong. Test: Text prototype aligned embedding to activate llm’s ability for time series. arXiv preprint arXiv:2308.08241, 2023.
  • [20] Xiaozhe Wang, Kate Smith, and Rob Hyndman. Characteristic-based clustering for time series data. Data mining and knowledge Discovery, 2006.
  • [21] Hao Xue and Flora D Salim. Promptcast: A new prompt-based learning paradigm for time series forecasting. IEEE Transactions on Knowledge and Data Engineering, 2023.
  • [22] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In AAAI 2021, pages 11106–11115. AAAI Press, 2021.

Appendix A Appendix

A.1 Dataset description and the External Knowledge incorporated in the Prompts

In this subsection, we briefly introduce the datasets we use, which also serve as the external knowledge incorporated into the prompts. Following [10], we downsample the input series to an hourly frequency, yielding a total of 267 observations and resulting in relatively small datasets. Additionally, we incorporate Memorization datasets published after September 2021, the cutoff date for GPT-3.5-turbo, to demonstrate the effectiveness of TimeLLM and our proposed methods. Finally, we implemented univariate time series forecasting to predict the ’OT’ feature on the ETTh1 and ETTm2 datasets, focusing on the last 96 steps of the test set.

A.1.1 Darts [11]

(1)AirPassengersDataset: This is a series of monthly passenger numbers for international flights, where each value is in thousands of passengers for that month.

(2)AusBeerDataset: This is a quarterly series of beer production, with each value representing the kiloliters of beer produced in that quarter.

(3)GasRateCO2Dataset: This time series dataset describes monthly carbon dioxide emissions.

(4)MonthlyMilkDataset: This time-series data set describing monthly milk production. Each is the average number of tons of milk each cow produces during the month.

(5)SunspotsDataset: This dataset records the number of sunspots each month, where each data is the number of sunspots in that month.

(6)WineDataset: This is a dataset of monthly wine production in Australia, where each figure is the number of wine bottles produced in that month.

(7)WoolyDataset: This is an Australian yarn production for each quarter, where each value is how many tons of yarn were produced in that quarter.

(8)HeartRateDataset: The series contains 1800 uniformly spaced instantaneous heart rate measurements from a single subject.

(9)ETTh1: This is a time series dataset containing high-frequency energy data of a certain region in China, which is mainly used for energy load forecasting and related time series analysis research.

(10)ETTm2: Similar to ETTh1, ETTm2 is also an energy time series dataset, but the data frequency or region covered may be different, which is also used for the analysis and prediction of energy consumption.

A.1.2 Memorization Datasets [10]

(11)TSMCStockDataset: This is historical trading data about Taiwan Semiconductor Manufacturing Corporation (TSMC) stock, containing information such as share price, volume, and date, and is commonly used in financial analysis and stock market forecasting research.

(12)TurkeyPowerDataset: This is a time series dataset on national electricity consumption in Turkey, which records the electricity usage in Turkey over a period of time and is often used to analyze and forecast electricity demand.

(13)IstanbulTrafficDataset: This dataset offers hourly Traffic Index data for Istanbul from October 2022 to May 2023.

A.1.3 Monash Datasets [9]

(14)US Births Dataset: This dataset contains the number of births in the US from 1969 to 1988.

(15)Saugeen River Flow Dataset: This dataset contains the daily mean flow of the Saugeen River at Walkerton in cubic meters per second from 1915 to 1979.

A.2 Implementation Details

In this subsection, we provide a comprehensive overview of the experiments conducted to investigate the preferences of LLMs for input time series data. We first describe both the real and synthesized datasets we use and then detail the methods we use to investigate the preferences of LLMs.

A.2.1 Real Datasets

We begin by comparing the performance of Large Language Models (LLMs) with traditional time series forecasting methods. The results are depicted in Figures Figure 4 and Figure 5, while Tables Table 9 and Table 10 present the computed metrics in tabular format.

To understand the preferences of LLMs, we conducted experiments on ten commonly used datasets: HeartRateDataset, GasRateCO2Dataset, AirPassengersDataset, AusBeerDataset, MonthlyMilkDataset, SunspotsDataset, WineDataset, WoolyDataset, IstanbulTrafficDataset and TurkeyPowerDataset. We apply the Seasonal-Trend decomposition using the LOESS (STL) technique to decompose the original time series into trend, seasonal, and residual components. In those datasets, we obtain the periods through the nature of the data. For instance, the number of passengers is collected monthly in AirPassengersDataset, and it’s natural to obtain that the period is 12. For the datasets without explicit periods, such as the IstanbulTrafficDataset, the period is determined through the periodogram, a widely used tool in signal processing assisting the identification of the time series period. The strengths and the model performance can be seen in Table 6.

Subsequently, we compute trend strength QTsubscript𝑄𝑇Q_{T}italic_Q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and seasonal strength QSsubscript𝑄𝑆Q_{S}italic_Q start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT to measure all of those components. We use MAPE to compute the Pearson correlation coefficients (PCC) across every two indexes and observe a relatively strong correlation between the strengths and model performance, signifying that LLMs perform better when the input time series possesses higher trend and seasonal strength (Shown in Table 1). Notably, GPT-4-turbo achieved a higher absolute PCC compared to GPT-3.5-turbo. It may be attributed to human feedback during GPT-4-turbo training, as individuals may be more aware of seasonal and trend data. This may provide some insights for further research into the characteristics of the LLMs with time series forecasting.

We also conduct a counterfactual analysis using a systematic permutation of the input time series. We begin by defining a sliding window that determines the length of the periods and add Gaussian noise to the data within this window. To reduce costs, we move the window by the period length. This method allows us to assess the importance of segments that LLM predominantly focuses on. Our observations suggest that introducing noise towards the end of the time series significantly affects LLM’s performance, leading to the inference that LLM tends to give more weight to the latter part of the time series in most instances.

A.2.2 Synthesized Datasets

To investigate the influence of the number of periods on model performance, we generated a dataset using the function y=αx+β1cos(2πf1x)+β2cos(2πf2x)+ϵ𝑦𝛼𝑥subscript𝛽1𝑐𝑜𝑠2𝜋subscript𝑓1𝑥subscript𝛽2𝑐𝑜𝑠2𝜋subscript𝑓2𝑥italic-ϵy=\alpha*x+\beta_{1}*cos(2\pi f_{1}*x)+\beta_{2}*cos(2\pi f_{2}*x)+\epsilonitalic_y = italic_α ∗ italic_x + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∗ italic_c italic_o italic_s ( 2 italic_π italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∗ italic_x ) + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∗ italic_c italic_o italic_s ( 2 italic_π italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∗ italic_x ) + italic_ϵ, where α,β1,β2𝛼subscript𝛽1subscript𝛽2\alpha,\beta_{1},\beta_{2}italic_α , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote the coefficients of the trend and seasonal components. We set β1=2subscript𝛽12\beta_{1}=2italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2, β2[1,3)subscript𝛽213\beta_{2}\in[1,3)italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ 1 , 3 ), and α[0.2,0.7]𝛼0.20.7\alpha\in[0.2,0.7]italic_α ∈ [ 0.2 , 0.7 ], uniformly sampled for 10 instances each, and f1=1subscript𝑓11f_{1}=1italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 and f2=3subscript𝑓23f_{2}=3italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3 to represent the chosen frequencies. Similar to the previous experiments, x𝑥xitalic_x ranges from 0 to 20 and ϵitalic-ϵ\epsilonitalic_ϵ follows the normal distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ). Our results reveal that LLMs exhibit worse performance when input sequences contain multiple periods, even when the seasonal strength is carefully controlled to be nearly unchanged, as is shown in Figure 8. This observation may be attributed to the LLMs’ challenge in recognizing and adapting to multiple periods, similar to human behavior.

A.3 Detailed Prompts for the Period Prediction

GPT-3.5-turbo and GPT-4-turbo.

     You are a helpful assistant that specializes in time series analysis. The user will provide a sequence. The sequence is separated by commas. You need to infer the most probable underlying cycle of the sequence, even though there may also be a trend in this sequence. Do not assume that the underlying cycle has to stick to the popular cycles like 7 (days in a week), 12 (months in a year) and 30 (days in a month), just infer the cycle fully based on the inherent cycle of the given sequence. The underlying cycle of the sequence is strictly below 8. Please infer the cycle without producing any additional text. Again, the sequence is separated by commas. Sequence:     

Gemini.

     You are a helpful assistant that specializes in time series analysis. The user will provide a sequence. The sequence is represented by decimal strings separated by commas. You need to infer the most probable underlying cycle of the sequence, even though there may also be a trend in this sequence. Do not assume that the underlying cycle has to stick to the popular cycles like 7 (days in a week), 12 (months in a year) and 30 (days in a month), just infer the cycle fully based on the inherent cycle of the given sequence. The underlying cycle of the sequence is strictly below 8. Please infer the cycle without producing any additional text. Sequence:     

A.4 Tests on Data Leakage

Since the detailed datasets used for training GPT-4 and Gemini are not directly accessible, we conducted several experiments to indirectly investigate the data leakage issue. First, we performed an "Acknowledge Test" where we asked the LLMs if they recognize the dataset based on its name. Next, we conducted a "Series Test" by asking the LLMs to predict the first 20 steps of the datasets. Finally, we carried out a "Dataset Detection" test, where we fed the first 20 steps into the LLMs and asked them to identify the dataset. All these experiments help us determine whether the LLMs only recognize the name or have detailed knowledge of the datasets.

As shown in 4, we observe that both GPT and Gemini are generally aware of the content of most datasets. However, detailed sequence data is primarily known only for the AirPassengers dataset when using GPT, and this level of detail is not known for other datasets. During the experiments on the Dataset Detection, it is found that both GPT and Gemini could identify which dataset is being used based on the first 20 data points of the time series. All these results indicate that while they can recognize and identify datasets from limited information, they typically do not possess detailed knowledge of the sequence data for a broader range of datasets.

Table 4: Summary of tests on different datasets.
Datasets Acknowledge Test (GPT) Acknowledge Test (Gemini) Series Test (GPT) Series Test (Gemini) Dataset Detection (GPT) Dataset Detection (Gemini)
AirPassengers Yes Yes Yes No No No
AusBeer No Yes No No No No
GasRateCO2 No Yes No No No No
MonthlyMilk Yes Yes No No No No
Sunspots Yes Yes No No No No
Wine Yes Yes No No No No
Wooly No No No No No No
HeartRate Yes Yes No No No No

A.5 Computational Cost

We list the average token length cost associated with external knowledge enhancing and natural language paraphrasing for reference. Avg Token Length(ori) is the prompt Length of the unexecuted method, and Avg Token Length(EKE, NLP) is the prompt length after executing the corresponding policy. It is noted that Natural Language Paraphrasing is judged one by one through hard coding. Besides, there is a length check after transformation, so it is guaranteed that a certain length can be obtained each time. The results are shown in 5.

Table 5: Comparison of Avg Token Lengths among Original TimeLLM method, External Knowledge Enhancing and Natural Language Paraphrasing.
Datasets Avg Token Length (ori) Avg Token Length (EKE) Avg Token Length (NLP)
AirPassengers 200 224 797
AusBeer 200 220 797
GasRateCO2 200 211 797
MonthlyMilk 200 218 797
Sunspots 200 217 797
Wine 200 217 797
Wooly 200 216 797
HeartRate 200 214 797
Table 6: Model performance in the analysis of LLMs’ preferences.
Dataset Name GPT4-MAPE GPT3.5-MAPE Trend Strength Seasonal Strength
AirPassengersDataset 6.80 9.98 1.00 0.98
AusBeerDataset 3.69 5.12 0.99 0.96
MonthlyMilkDataset 5.12 6.25 1.00 0.99
SunspotsDataset 334.30 194.29 0.81 0.28
WineDataset 10.90 14.98 0.67 0.92
WoolyDataset 20.41 19.26 0.96 0.82
IstanbulTrafficGPT 47.29 60.11 0.31 0.72
GasRateCO2Dataset 4.21 5.97 0.65 0.50
HeartRateDataset 7.90 6.75 0.42 0.49
TurkeyPower 3.36 3.52 0.90 0.88

A.6 Figures and Tables

Refer to caption
Figure 4: (Left: ARIMA, Center: GPT-3.5-turbo, Right: GPT-4-turbo)
The predicted results of AirPassengers, AusBeerDataset, GasRateCO2, HeartRate, Istanbul-Traffic datasets.
Refer to caption
Figure 5: (Left: ARIMA, Center: GPT-3.5-turbo, Right: GPT-4-turbo)
The predicted results of MonthlyMilk, Sunspots, TSMCStock, TurkeyPower, WineDataset, Wooly datasets.
Refer to caption
(a) MonthlyMilk
Refer to caption
(b) AusBeerDataset
Refer to caption
(c) WineDataset
Refer to caption
(d) TurkeyPower
Refer to caption
(e) TSMCStock
Refer to caption
(f) IstanbulTraffic
Refer to caption
(g) GasRateCO2
Refer to caption
(h) AirPassengers
Figure 6: Experiments of Sequence Focused Attention Through Counterfactual Explanation on GPT-3.5-turbo
Refer to caption
(a) MonthlyMilk
Refer to caption
(b) AusBeerDataset
Refer to caption
(c) WineDataset
Refer to caption
(d) TurkeyPower
Refer to caption
(e) TSMCStock
Refer to caption
(f) IstanbulTraffic
Refer to caption
(g) GasRateCO2
Refer to caption
(h) AirPassengers
Figure 7: Experiments of Sequence Focused Attention Through Counterfactual Explanation on Gemini-Pro-1.0.
Table 7: The results of natural language paraphrasing of sequences and baseline comparison.
Models Datasets Natural Language Paraphrasing LLMTime Prediction
MSE MAE MAPE MSE MAE MAPE
GPT-3.5-Turbo AirPassengers 267.66 3.66 0.99 6244.07 61.39 14.43
AusBeer 598.45 5.81 1.36 841.68 23.59 5.62
GasRateCO2 3.16 0.46 0.85 10.88 2.66 4.73
MonthlyMilk 968.69 8.61 1.02 7507.13 66.28 112.77
Sunspots 251.61 4.27 20.42 6556.55 58.95 217.94
(GPT-3.5-turbo-1106) Wine 11403.89 96.95 37.04 30488.60 388.28 15.83
Wooly 12110.16 33.23 4.07 526903.08 574.58 12.00
HeartRate 4.38 0.55 0.57 76.83 7.15 7.42
Istanbul-Traffic 224.17 3.74 8.81 335.05 6.75 11.68
Turkey Power 24382136.98 1843.64 4.68 3882704.14 1315.6 3.58
ETTh1 1.21 0.48 54.17 5.64 2.71 1.625
ETTm2 0.81 0.36 27.33 3.46 2.17 1.178
US Births 926136.72 633.37 7.67 1323052.46 904.81 9.61
Saugeen River Flow 2870.19 18.31 19.43 4100.27 26.16 27.76
GPT-4-Turbo AirPassengers 133.10 2.87 0.80 1286.25 28.04 6.07
AusBeer 661.80 7.24 1.63 513.49 18.57 4.28
GasRateCO2 2.28 0.41 0.75 7.27 2.32 4.18
MonthlyMilk 413.63 4.94 0.57 4442.18 50.75 172.82
Sunspots 194.52 5.30 16.10 3374.70 41.87 321.11
(GPT-4-turbo-preview) Wine 56138.87 54.67 23.63 22488.17 253.08 9.98
Wooly 18063.64 11.06 25.06 942987.19 871.64 18.55
HeartRate 11.64 1.21 1.30 988.14 26.57 29.22
Istanbul-Traffic 176.91 3.88 9.67 195.33 5.53 10.03
Turkey Power 60601807.53 3118.43 8.03 113873.28 814.46 2.17
ETTh1 1.20 0.49 47.62 4.73 1.53 3.282
ETTm2 0.45 0.27 23.62 2.30 1.034 1.607
US Births 676264.81 501.13 4.81 966092.59 678.55 7.12
Saugeen River Flow 3100.57 19.18 21.86 4190.05 27.41 32.15
Llama-2 AirPassengers 751.34 6.77 1.53 1317.9 55.49 11.18
AusBeer 591.75 23.25 5.41 644.82 17.88 4.08
GasRateCO2 10.16 2.89 5.16 12.78 2.97 5.47
MonthlyMilk 851.17 84.83 9.46 3410.20 41.40 240.25
Sunspots 1483.29 33.27 17.79 4467.67 48.95 91.79
(llama-2-13B) Wine 102434.52 852.97 34.72 951194.94 240.08 9.45
Wooly 12180.05 83.99 16.92 675062.52 736.04 15.83
HeartRate 49.8 5.84 6.53 75.58 7.11 7.94
Istanbul-Traffic 306.80 5.39 7.24 438.28 7.28 9.81
Turkey Power 3278744.18 2191.34 28.76 2919773.15 1388.10 3.70
ETTh1 1.47 0.87 58.34 4.84 1.79 3.178
ETTm2 0.84 0.41 29.86 3.31 2.07 2.153
US Births - - - - - -
Saugeen River Flow - - - - - -
Gemini-Pro-1.0 AirPassengers 4474.54 31.54 7.02 6392.21 63.57 14.03
AusBeer 278.45 10.05 2.29 397.78 14.36 3.27
GasRateCO2 13.29 2.50 4.38 18.99 3.57 6.46
MonthlyMilk 440.29 11.91 1.39 628.98 17.01 1.99
Sunspots 438.29 10.47 1.21 626.03 14.94 1.73
(gemini-1.0-pro) Wine 181008.34 2557.62 10.89 258584.78 3645.23 14.60
Wooly 45.44 4.47 4.93 64.92 6.39 7.04
HeartRate 40.57 4.20 4.67 57.96 6.01 6.66
Istanbul-Traffic 267.43 5.69 8.37 321.56 7.32 9.71
Turkey Power 45674.41 2973.54 11.21 103203.37 2195.68 6.07
ETTh1 1.17 0.74 54.86 4.84 1.79 3.178
ETTm2 0.88 0.39 21.82 3.31 2.07 2.153
US Births 467743.19 440.68 4.27 687862.05 595.51 6.28
Saugeen River Flow 2857.23 19.34 18.80 4081.75 25.91 26.87
Table 8: The results of external knowledge enhancement and baseline comparison.
Models Dataset External Knowledge Enhancing LLMTime Prediction
MSE MAE MAPE MSE MAE MAPE
AirPassengers 3713.99 50.37 10.88 6244.07 61.39 14.43
AusBeer 669.01 21.82 5.12 841.68 23.59 5.62
GasRateCO2 16.47 3.36 5.97 10.88 2.66 4.73
MonthlyMilk 4781.26 55.45 6.25 7507.13 66.28 112.77
Sunspots 7072.42 62.61 194.29 6556.55 58.95 217.94
GPT-3.5-turbo-1106 Wine 24925885.81 3548.19 14.98 30488.60 388.28 15.83
(GPT-3.5-turbo-1106) Wooly 955708.49 893.02 19.26 526903.08 574.58 12.00
HeartRate 59.83 6.44 6.75 76.83 7.15 7.42
Istanbul-Traffic 888.31 28.16 60.11 1321.44 48.7 7.47
TSMC-Stock 73.83 7.31 1.54 298.58 15.44 3.23
Turkey Power 2613198.17 1301.83 3.52 3882704.14 1315.6 3.58
ETTh1 2.65 1.01 132.13 5.64 2.71 1.625
ETTm2 2.00 0.89 201.84 3.46 2.17 1.178
AirPassengers 1262.24 30.54 6.80 1286.25 28.04 6.07
AusBeer 345.59 15.70 3.69 513.49 18.57 4.28
GasRateCO2 6.99 2.29 4.21 7.27 2.32 4.18
MonthlyMilk 2209.33 44.02 5.12 4442.18 50.75 172.82
Sunspots 4571.92 50.24 334.30 3374.70 41.87 321.11
GPT-4-turbo-preview Wine 14426570.88 2734.41 10.90 22488.17 253.08 9.98
(GPT-4-turbo-preview) Wooly 1078968.96 959.42 20.41 942987.19 871.64 18.55
HeartRate 78.99 6.96 7.90 988.14 26.57 29.22
Istanbul-Traffic 954.88 26.92 47.29 1291.17 32.16 6.46
TSMC-Stock 104.53 8.46 1.79 74.71 6.60 1.39
Turkey Power 3090055.89 1223.78 3.36 113873.28 814.46 2.17
ETTh1 2.70 1.06 129.99 4.73 1.53 3.282
ETTm2 1.18 0.79 291.67 2.30 1.034 1.607
AirPassengers 3713.99 50.37 10.88 1286.25 28.04 6.07
AusBeer 893.56 21.49 4.87 644.82 17.88 4.08
GasRateCO2 11.38 3.04 5.49 12.78 2.97 5.47
MonthlyMilk 4722.32 60.36 7.05 3410.20 41.40 240.25
Sunspots 4000.19 46.45 138.69 4467.67 48.95 91.79
Llama-2 Wine 8286095.02 2261.30 8.97 951194.94 240.08 9.45
(llama-2-13B) Wooly 389685.08 551.18 11.69 675062.52 736.04 15.83
HeartRate 112.17 7.86 8.93 75.58 7.11 7.94
Istanbul-Traffic 979.15 26.70 45.57 1531.37 34.74 7.42
TSMC-Stock 52105.36 196.02 42.07 2203.97 27.64 27.39
Turkey Power 3416162.71 1547.49 4.09 2919773.15 1388.10 3.70
ETTh1 4.15 1.65 408.11 4.84 1.79 3.178
ETTm2 3.08 1.47 810.56 3.31 2.07 2.153
AirPassengers 5237.85 51.92 11.08 6392.21 63.57 14.03
AusBeer 325.45 10.84 1.86 397.78 14.36 3.27
GasRateCO2 15.54 3.23 4.43 18.99 3.57 6.46
MonthlyMilk 491.26 15.18 1.13 628.98 17.01 1.99
Sunspots 491.64 11.15 1.27 626.03 14.94 1.73
Gemini-1.0-pro Wine 210818.24 3230.41 8.35 258584.78 3645.23 14.60
(gemini-1.0-pro) Wooly 51.04 5.70 7.93 64.92 6.39 7.04
HeartRate 47.45 4.83 4.67 57.96 6.01 6.66
Istanbul-Traffic 1253.74 28.25 5.42 1531.37 34.74 7.42
TSMC-Stock 153.73 5.02 1.05 188.18 6.67 1.65
Turkey Power 83522.95 1812.31 5.51 103203.37 2195.68 6.07
ETTh1 2.92 1.45 2.88 4.84 1.79 3.178
ETTm2 2.00 1.74 1.22 3.31 2.07 2.153
Refer to caption
(a) GPT-3.5-turbo
Refer to caption
(b) Gemini
Figure 8: Results on the multiple periods within the Synthesized Dataset.
Table 9: Comparison test of traditional prediction methods (Part I).
Dataset Method MSE MAE MAPE
AirPassengers Exponential Smoothing 2007.67 37.91 8.10
SARIMA 2320.47 39.80 8.46
Cyclical Regression 2028.37 36.70 8.52
AutoARIMA 8702.09 68.52 13.98
FFT 3274.46 46.38 10.59
StatsForecastAutoARIMA 2952.52 45.41 9.71
Naive Mean 47703.65 204.25 44.61
Naive Seasonal 6032.80 62.87 14.18
Naive Drift 6505.79 72.21 17.50
Naive Moving Average 6032.80 62.87 14.18
N-Beats 3994.55 54.95 12.81
DeepAR 184222.64 421.99 98.42
Prophet 7345.31 43.87 8.62
LLMTime with GPT-3.5-Turbo 6244.07 61.39 14.43
LLMTime with GPT-4-Turbo 1317.9 55.49 11.18
LLMTime with Gemini-1.0-pro 6392.21 63.57 14.03
LLMtime with Llama-2 1286.25 28.04 6.07
AusBeer Exponential Smoothing 703.26 22.80 5.44
SARIMA 475.53 19.07 4.49
Cyclical Regression 989.31 26.29 6.13
AutoARIMA 550.05 18.84 4.41
FFT 7682.56 73.74 17.44
StatsForecastAutoARIMA 559.46 20.56 4.86
Naive Mean 1885.72 30.66 6.68
Naive Seasonal 10828.02 96.35 23.39
Naive Drift 18507.61 128.23 30.91
Naive Moving Average 10828.02 96.35 23.39
N-Beats 250.61 14.42 3.53
DeepAR 16197.17 40.23 9.89
Prophet 6323.89 28.76 6.92
LLMTime with GPT-3.5-Turbo 841.68 23.59 5.62
LLMTime with GPT-4-Turbo 513.49 18.57 4.28
LLMTime with Gemini-1.0-pro 397.78 14.36 3.27
LLMtime with Llama-2 644.82 17.88 4.08
MonthlyMilk Exponential Smoothing 564.94 20.23 2.41
SARIMA 1289.76 32.78 3.87
Cyclical Regression 3631.53 56.15 6.60
AutoARIMA 2682.67 42.82 5.20
FFT 3453.96 45.62 5.48
StatsForecastAutoARIMA 186.14 10.64 1.28
Naive Mean 19893.07 127.33 14.46
Naive Seasonal 4870.40 56.00 6.31
Naive Drift 3998.11 56.06 6.52
Naive Moving Average 4870.40 56.00 6.31
N-Beats 3140.89 51.57 6.07
DeepAR 728289.50 851.30 99.22
Prophet 663.41 25.76 2.92
LLMTime with GPT-3.5-Turbo 7507.13 66.28 112.77
LLMTime with GPT-4-Turbo 4442.18 50.75 172.82
LLMTime with Gemini-1.0-pro 628.98 17.01 1.99
LLMtime with Llama-2 3410.20 41.40 240.25
Sunspots Moving Average 326750.49 499.78 3129.63
Exponential Smoothing 326750.49 499.78 3129.63
SARIMA 2902.72 45.75 466.99
Cyclical Regression 3917.76 47.84 274.31
AutoARIMA 4695.67 58.47 709.23
FFT 3784.56 49.81 150.32
StatsForecastAutoARIMA 8406.55 72.99 95.18
Naive Mean 4120.40 49.84 267.22
Naive Seasonal 4440.63 56.78 688.58
Naive Drift 5032.77 60.40 724.88
Naive Moving Average 4440.63 56.78 688.58
N-Beats 4877.59 56.58 105.55
DeepAR 3421.02 48.93 132.76
Prophet 6303.57 76.83 67.97
LLMTime with GPT-3.5-Turbo 6556.55 58.95 217.94
LLMTime with GPT-4-Turbo 3374.70 41.87 321.11
LLMTime with Gemini-1.0-pro 626.03 14.94 1.73
LLMtime with Llama-2 4467.67 48.95 91.79
Table 10: Comparison test of traditional prediction methods(Part II).
Dataset Method MSE MAE MAPE
WineDataset Exponential Smoothing 23709576.52 3370.78 14.23
SARIMA 1150166.94 966.57 20.76
Cyclical Regression 7873785.27 2148.24 8.52
AutoARIMA 698661.90 646.03 14.07
FFT 1031170.45 867.83 18.60
StatsForecastAutoARIMA 20040877.37 2853.17 12.05
Naive Mean 11557786.19 2200.04 8.80
Naive Seasonal 879447.22 724.23 15.52
Naive Drift 9609576.04 1833.38 7.36
Naive Moving Average 9070696.99 1719.17 6.90
N-Beats 5418377.00 1887.30 7.68
DeepAR 715027008.00 26236.14 89.91
Prophet 4846922.27 2201.57 8.27
LLMTime with GPT-3.5-Turbo 30488.60 388.28 15.83
LLMTime with GPT-4-Turbo 22488.17 253.08 9.98
LLMTime with Gemini-1.0-pro 258584.78 3645.23 14.60
LLMtime with Llama-2 951194.94 240.08 9.45
WoolyDataset Exponential Smoothing 24925885.81 3548.19 14.98
SARIMA 812352.21 759.07 16.37
Cyclical Regression 1032574.82 962.72 22.14
AutoARIMA 838852.91 786.25 16.84
FFT 1012255.35 945.20 20.80
StatsForecastAutoARIMA 917617.19 858.57 18.91
Naive Mean 816762.31 764.73 16.12
Naive Seasonal 1051110.81 982.25 22.19
Naive Drift 812352.21 759.07 16.37
Naive Moving Average 1032574.82 962.72 22.14
N-Beats 653104.31 743.54 15.96
DeepAR 243831.14 4897.85 94.89
Prophet 365241.98 891.70 34.65
LLMTime with GPT-3.5-Turbo 526903.08 574.58 12.00
LLMTime with GPT-4-Turbo 942987.19 871.64 18.55
LLMTime with Gemini-1.0-pro 64.92 6.39 7.04
LLMtime with Llama-2 675062.52 736.04 15.83
HeartRateDataset Exponential Smoothing 11.16 1.38 1.49
SARIMA 12.98 1.34 1.61
Cyclical Regression 13.58 1.31 1.20
AutoARIMA 13.26 1.25 1.39
FFT 13.95 1.16 1.34
StatsForecastAutoARIMA 10.53 1.27 1.39
Naive Mean 12.02 1.27 1.26
Naive Seasonal 10.55 1.32 1.31
Naive Drift 10.60 1.15 1.30
Naive Moving Average 12.13 1.27 1.34
N-Beats 72.11 7.10 7.40
DeepAR 286.82 15.67 16.36
Prophet 88.93 10.97 6.54
LLMTime with GPT-3.5-Turbo 76.83 7.15 7.42
LLMTime with GPT-4-Turbo 988.14 26.57 29.22
LLMTime with Gemini-1.0-pro 57.96 6.01 6.66
LLMtime with Llama-2 75.58 7.11 7.94
Weather Exponential Smoothing 1684.38 31.60 6.79
SARIMA 1943.81 33.33 7.09
Cyclical Regression 1700.73 30.77 7.15
AutoARIMA 7315.10 57.44 11.70
FFT 2752.02 38.90 8.87
StatsForecastAutoARIMA 2479.55 38.06 8.16
Naive Mean 39879.84 168.27 36.44
Naive Seasonal 5057.47 52.81 11.89
Naive Drift 5466.23 60.58 14.70
Naive Moving Average 5057.47 52.81 11.89
N-Beats 4532.84 39.21 23.49
DeepAR 6325.75 35.97 16.59
Prophet 3768.15 29.36 24.01
LLMTime with GPT-3.5-Turbo 224.54 3.07 0.83
LLMTime with GPT-4-Turbo 111.65 2.40 0.64
LLMTime with Gemini-1.0-pro 176.32 3.72 0.75
LLMtime with Llama-2 215.39 4.07 1.31