JP3771682B2

JP3771682B2 - Vector processing equipment

Info

Publication number: JP3771682B2
Application number: JP21721097A
Authority: JP
Inventors: 貴司持山; 武史曽我
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1997-08-12
Filing date: 1997-08-12
Publication date: 2006-04-26
Anticipated expiration: 2017-08-12
Also published as: JPH1166046A

Description

【０００１】
【発明の属する技術分野】
本発明はベクトル処理装置に係り、特に主記憶からベクトル・レジスタへのベクトルロード命令の実行回数を削減することによりベクトル演算の高速化をはかるものである。
【０００２】
【従来の技術】
ベクトル処理装置においてベクトル演算を行う場合、図５に示す如く、主記憶１００上のデータをベクトル部１０１のベクトル・レジスタ部１１０のベクトル・レジスタＶＲ１、ＶＲ２にベクトルロードすることが必要である。主記憶１００上のデータをベクトル・レジスタ部１１０のベクトル・レジスタＶＲ１、ＶＲ２に格納する際に、ベクトル・アクセス・パイプ１１１を使用してベクトルロード命令を実行する。この場合、アドレスが１ワードずれた配列をベクトル・レジスタ１１０にロードする場合でも２回のベクトルロード命令を実行していた。
【０００３】
例えば下記のＤＯループ１０
ＤＯ１０Ｉ＝１、１００
Ａ（Ｉ）＝Ｂ（Ｉ）＋Ｂ（Ｉ＋１）
１０ＣＯＮＴＩＮＵＥ
を実行する場合のように、同一配列Ｂで１ワードずれた場合でも、２回主記憶アクセスを行うことが必要であった。
【０００４】
なお上記ＤＯループ１０のプログラムは、下記の演算を行うことを示すものである。
Ａ（１）＝Ｂ（１）＋Ｂ（２）・・・・（１）
Ａ（２）＝Ｂ（２）＋Ｂ（３）・・・・（２）
Ａ（３）＝Ｂ（３）＋Ｂ（４）・・・・（３）
・
・
・
Ａ（１００）＝Ｂ（１００）＋Ｂ（１０１）・・・（１００）
このプログラムを実行するとき、第１回目のベクトルロード命令により右側の項のうちの１番目の要素Ｂ（１）、Ｂ（２）・・・Ｂ（１００）を主記憶１００からベクトル・レジスタ部１１０のベクトル・レジスタＶＲ１に格納する。それから第２回目のベクトルロード命令により２番目の要素Ｂ（２）、Ｂ（３）・・・Ｂ（１０１）を主記憶１００からベクトル・レジスタＶＲ２に格納する。
【０００５】
そしてベクトル演算器１１２により上記各式の演算を行い、その演算結果をベクトル・レジスタＶＲ３に順次格納する。
なお、図５において１０２はスカラ部、１１３はベクトル命令制御部である。
【０００６】
【発明が解決しようとする課題】
前記の如くベクトル演算を行うとき、ベクトル・レジスタＶＲ２に格納される各要素Ｂ（２）、Ｂ（３）・・・Ｂ（１０１）が、ベクトル・レジスタＶＲ１に格納されている各要素Ｂ（１）、Ｂ（２）・・・Ｂ（１００）と、主記憶１００上で１ワードしかずれていない配置の場合であっても、ベクトル・アクセス・パイプ１１１が主記憶に対するアクセスを２回行うことが必要であり、このため演算速度が遅くなり、性能低下の原因となっていた。
【０００７】
即ち、前記ＤＯループ１０の命令を実行するとき、ベクトル命令制御部１１３はベクトル・アクセス・パイプ１１１に対し、第１回目のベクトルロード命令を発行する。これによりベクトル・アクセス・パイプ１１１が主記憶１００から配列Ｂの要素（Ｉ）（Ｉ＝１〜１００）を読み出してベクトル・レジスタＶＲ１に格納する。このとき格納される対象は、Ｂ（１）〜Ｂ（１００）の１００要素でありＶＲ１（１）〜ＶＲ１（１００）に格納される。
【０００８】
次に第２回目のベクトルロード命令が発行され、ベクトル・アクセス・パイプ１１１はＢ（Ｉ＋１）を読み出し、ベクトル・レジスタＶＲ２に格納する。このとき格納される対象はＢ（２）〜Ｂ（１０１）の１００要素であり、ＶＲ２（１）〜ＶＲ２（１００）にそれぞれ格納される。
【０００９】
引続き、ベクトル・レジスタＶＲ１とＶＲ２が読み出され、ベクトル演算器１１２でベクトル加算が実行され、結果がベクトル・レジスタＶＲ３に格納される。そしてこのＶＲ３に格納された結果を主記憶１００上の配列Ａ（Ｉ）に格納して、このＤＯループの処理が終了する。
【００１０】
一般にベクトル・レジスタは主記憶から遠く、ベクトル・アクセス・パイプの性能は低い。このため２回ベクトルロードを実行することは大きいオーバヘッドとなり、性能低下の原因となる。
【００１１】
したがって本発明の目的はこのようなアクセス回数を減少して演算速度の向上を図るベクトル処理装置を提供することである。
【００１２】
【課題を解決するための手段】
前記目的を達成するための本発明の関連技術の構成を図１に示す。図１において、１は主記憶、２はベクトル部、３はスカラ部、１０はベクトル・レジスタ部、１１はベクトル・アクセス・パイプ、１２はベクトル演算器、１３はベクトル命令制御部である。ベクトル命令制御部１３は、図１に示す如く、例えば配列レジスタ番号ＶＲ１に格納された要素を配列レジスタ番号ＶＲ２にスライドすることができる。前記本発明の目的は、下記の（１）〜（４）に記載された構成により達成される。
【００１３】
（１）本発明のベクトル処理装置では、主記憶手段と、ベクトル・レジスタ手段と、ベクトル演算手段を具備し、ベクトル・レジスタに保持されている要素を読み出し、これを要素番号の方向にシフトして前記ベクトル・レジスタとは別のレジスタの異なる要素番号区分に格納するベクトル処理装置において、前記ベクトル・レジスタ手段を複数のブロックにより構成し、通常のベクトル演算においては各ブロックが独立に動作し、ベクトル総和演算等の要素間の演算のために各ブロック間に専用データバスを設け、この専用データバスを用いてブロック間で要素の転送を行うとともに、ベクトル・アクセス・パイプとは独立に並行してベクトルスライドを実行するベクトルスライド手段を設け、ベクトル演算器と独立に、並行してベクトルスライド処理を実行することを特徴とする。
【００１４】
（２）本発明のベクトル処理装置では、前記（１）において、前記要素番号が１つ増加又は減少する方向に要素を格納することを特徴とする。
【００１５】
（３）本発明のベクトル処理装置では、主記憶手段と、ベクトル・レジスタ手段と、ベクトルマスクレジスタ手段と、ベクトル演算手段を具備するベクトル処理装置において、前記ベクトル・レジスタ手段を複数のブロックにより構成し、通常のベクトル演算においては各ブロックが独立に動作し、ベクトル総和演算等の要素間の演算のために各ブロック間に専用データバスを設け、この専用データバスを用いてブロック間で要素の転送を行うとともに、ベクトル・レジスタに保持されている要素を読み出し、これを要素番号の方向にシフトして前記ベクトル・レジスタとは別のベクトル・レジスタの異なる要素番号区分に格納するとき、読み出し元のベクトル・レジスタの値と書き込み先のベクトル・レジスタの元の値をマージし、要素番号の方向に移動したときのベクトルマスクレジスタの対応するビットが「１」または「０」のいずれか一方の特定値である要素についてのみ格納を行うベクトル制御手段を具備したことを特徴とする。
【００１６】
（４）本発明のベクトル処理装置では、主記憶手段と、ベクトル・レジスタ手段と、ベクトル演算手段と、ベクトル・レジスタに保持されている要素を読み出し、これを要素番号の方向にシフトして前記ベクトル・レジスタとは別のレジスタの異なる要素番号区分に格納するベクトル処理装置において、前記ベクトル・レジスタ手段を複数のブロックにより構成し、通常のベクトル演算においては各ブロックが独立に動作し、ベクトル総和演算等の要素間の演算のために各ブロック間に専用データバスを設け、この専用データバスを用いてブロック間で要素の転送を行うとともに、ベクトル・アクセス・パイプとは独立にベクトル演算器と並行してベクトルスライドを実行するベクトルスライド手段を設け、ベクトル演算器と独立に、並行してベクトルスライド処理を実行することを特徴とする。
【００１７】
そしてこれにより次の如き作用効果を奏する。
（１）ベクトル・レジスタに保持されている要素を読み出し、これを要素番号の方向に移動してベクトル・レジスタの異なる要素番号区分に格納するベクトルスライド処理を行うことにより、ベクトルロード命令の回数が従来の２回から１回に削減され、高速ベクトル処理が実現できる。
【００１８】
（２）マスクレジスタを使用する場合でもベクトルロード命令の回数が削減され、高速ベクトル処理が実現できる。
【００１９】
（３）要素番号が１つ増加又は減少する方向に要素をスライド処理により格納するので、１ワードずれた配列を１回のベクトルロード命令により可能となり、ベクトルロード命令の回数を削減し、高速ベクトル処理ができる。しかもスライド処理制御のためのハード機構が簡単になり、制御も容易なものとなる。
【００２０】
（４）ベクトル・アクセス・パイプを使用せずにベクトル・アクセス・パイプとは独立に並行してベクトルスライドを実行することができるので、メモリアクセスを行いながらスライド処理を行うことができ、高速ベクトル処理ができる。
【００２１】
（５）ベクトル・レジスタ手段が複数のブロックにより構成されているときでも、従来から設けられている専用データバスを用いて要素の転送が可能となり、転送用の特別のピンを設ける必要なく、ベクトルロード命令の回数を削減するとともに、高速ベクトル処理ができる。
【００２２】
（６）専用のベクトルスライド手段を設けたので、ベクトル演算手段とは独立に並行してスライド処理を実行することができるので、スライド処理とベクトル演算とを同時に行うことができ、高速ベクトル処理ができる。
【００２３】
【発明の実施の形態】
本発明の関連技術を図１にもとづき説明する。図１において、１は主記憶、２はベクトル部、３はスカラ部、１０はベクトル・レジスタ部、１１はベクトル・アクセス・パイプ、１２はベクトル演算器、１３はベクトル命令制御部である。
【００２４】
主記憶１は、ベクトル処理装置を動作する各種のデータが記憶されるものであって、ベクトル演算されるべき配列の各要素が格納されたり、演算結果が格納されたり、ベクトル部２が実行すべき命令等が格納されるものである。
【００２５】
ベクトル部２は、主記憶１に格納されたデータにベクトル演算を行い、その演算結果を主記憶１に格納するものであって、ベクトル・レジスタ部１０、ベクトル・アクセス・パイプ１１、ベクトル演算器１２、ベクトル命令制御部１３等を具備している。
【００２６】
スカラ部３は主記憶１から命令を読み出し、これを解読してスカラ命令かベクトル命令かを識別し、スカラ命令であればこれを実行し、ベクトル命令であればこれをベクトル命令制御部１３に送出する。
【００２７】
ベクトル・レジスタ部１０は、ベクトル演算器１２でベクトル演算されるべき配列要素及びベクトル演算結果得られた配列要素を一時保持するものであり、主記憶１から読み出された配列Ｂの各要素Ｂ（１）、Ｂ（２）・・・Ｂ（１０１）が格納されるベクトル・レジスタＶＲ１と、ベクトル・レジスタＶＲ１の要素の一部Ｂ（２）、Ｂ（３）・・・Ｂ（１０１）がスライド格納されるベクトル・レジスタＶＲ２と、ベクトル・レジスタＶＲ１とベクトル・レジスタＶＲ２との各要素のベクトル演算結果が格納されるベクトル・レジスタＶＲ３等を具備している。
【００２８】
ベクトル・アクセス・パイプ１１は、主記憶１に対してベクトル要素をロードしたり、ストアするものであって、ロードの場合にはアクセス先のアドレス計算を行ったり、主記憶１に対してロード要求を発行し、これにより主記憶１から読み出された各要素を取り出してベクトル・レジスタ部１０に送出する。またストアの場合はアクセス先のアドレス計算を行い、主記憶１に対しストア要求を発行し、ストアすべき各要素を主記憶１に送出してこれらをストアするものである。
【００２９】
ベクトル演算器１２は、ベクトル・レジスタＶＲ１とＶＲ２の同一の要素番号の要素をそれぞれ読み出して演算を行い、演算結果をベクトル・レジスタＶＲ３の同一要素番号の区分に格納処理を行ったり、本発明の特徴とするスライド処理を行うものである。
【００３０】
このスライド処理は、ベクトル・レジスタＶＲ１から要素Ｂ（１）、Ｂ（２）・・・Ｂ（１０１）を読み出し、要素を、その番号が１つ減る方向の要素番号としてベクトル・レジスタＶＲ２に格納するものである。即ちベクトル・レジスタＶＲ１の要素番号２の要素Ｂ（２）は、２−１＝１つまりベクトル・レジスタＶＲ２の要素番号１の要素として格納され、ベクトル・レジスタＶＲ１の要素番号３の要素Ｂ（３）はベクトル・レジスタＶＲ２の要素番号２の要素として格納され、ベクトル・レジスタＶＲ１の要素番号１０１の要素Ｂ（１０１）はベクトル・レジスタＶＲ２の要素番号１００の要素として格納される。これにより図１に示すベクトル・レジスタＶＲ１の要素が、矢印に示す如く、ベクトル・レジスタＶＲ２の１つ減る方向に要素番号がずらして格納されることになる。
【００３１】
なお、図１に示す例は、ベクトル・レジスタＶＲ１から要素番号が１つ減る方向にずらしてベクトル・レジスタＶＲ２に格納されるものであるが、スライド処理はこれに限定されるものではなく、スライドされる量は１のみに限定されるものではなく２以上の任意の整数が選択されるものであり、また方向も減る方向のみならず要素番号の増加する方向にスライドすることもできる。
【００３２】
ベクトル命令制御部１３は、スカラ部３からベクトル命令を受け取ったときこれを解読して、その内容に応じてベクトル・レジスタ１０、ベクトル・アクセス・パイプ１１、ベクトル演算器１２等に対し選択的に制御指示を与えるものである。
【００３３】
次に、図１に示す本発明の関連技術に対する動作を、前記ＤＯループ１０の命令を実行する場合について説明する。
【００３４】
図示省略したコンパイラが、前記ＤＯループ１０の命令全体を認識することにより、これがベクトル命令であり、その演算に必要な要素は配列ＢのＢ（１）〜Ｂ（１０１）であること、そしてこれらの要素Ｂ（１）〜Ｂ（１０１）により前記（１）式〜（１００）式の演算を行うこと、またベクトル・レジスタＶＲ２に格納する要素は、ベクトル・レジスタＶＲ１に格納された要素を１つ減る方向にスライドしてＶＲ２に格納すればよいことを認識し、これらを実行する命令即ち後述するステップ１〜ステップ３を実行する命令を作成し、主記憶１に格納する。
【００３５】
（ステップ１）要素Ｂ（Ｉ）Ｉ＝１〜１０１を主記憶１から取り出し、ベクトル・レジスタＶＲ１に格納する。
【００３６】
（ステップ２）ベクトル・レジスタＶＲ１から要素Ｂ（１）〜Ｂ（１０１）を読み出し、要素を要素番号が１つ減る方向にずらしてベクトル・レジスタＶＲ２に格納する。従って、ベクトル・レジスタＶＲ２の要素番号１にはＢ（２）が格納され、要素番号２にはＢ（３）が格納され・・・要素番号（Ｉ）にはＢ（Ｉ＋１）が格納され・・・要素番号１００にはＢ（１０１）が格納される。これにより主記憶１から要素Ｂ（２）〜Ｂ（１０１）をＶＲ２に格納した場合と同じ結果が得られる。
【００３７】
（ステップ３）ベクトル・レジスタＶＲ１とＶＲ２をベクトル加算（Ｉ＝１〜１００）し、演算結果をベクトル・レジスタＶＲ３に格納する。
【００３８】
従って、コンパイラが作成した命令をスカラ部３が主記憶１から取り出し、これがベクトル命令であることが解読されると、これらの命令はベクトル命令制御部１３に送出される。
【００３９】
まず前記ステップ１を実行するベクトルロードがベクトル命令制御部１３からベクトル・アクセス・パイプ１１に送出され、ベクトル・アクセス・パイプ１１が主記憶１から配列Ｂの要素Ｂ（Ｉ）、（Ｉ＝１〜１０１）を読み出してベクトル・レジスタＶＲ１に格納する。このとき格納される対象はＢ（１）〜Ｂ（１０１）の１０１要素であり、ベクトル・レジスタＶＲ１（１）〜ＶＲ１（１０１）に格納される。
【００４０】
次にベクトル命令制御部１３は、前記ステップ２を実行するベクトルスライド命令をベクトル演算器１２に送出する。これによりベクトル演算器１２は、ベクトル・レジスタＶＲ１から要素Ｂ（１）〜Ｂ（１０１）を読み出し、その要素番号が１つ減る方向にずらしてベクトル・レジスタＶＲ２に格納する。
【００４１】
これにより、ベクトル・レジスタＶＲ１の要素番号２に格納されていた要素Ｂ（２）はベクトル・レジスタＶＲ２の要素番号１に格納され、ＶＲ１の要素番号３に格納されていた要素Ｂ（３）はＶＲ２の要素番号２に格納され、ＶＲ１の要素番号１０１に格納されていた要素Ｂ（１０１）はＶＲ２の要素番号１００に格納される。即ち、ＶＲ１の要素番号Ｉに格納されていた要素Ｂ（Ｉ）は、ＶＲ２の要素番号（Ｉ−１）に格納される。かくしてＶＲ１に格納された各データは１要素分上位方向にシフトしてＶＲ２に格納される。
【００４２】
このようにして、図１に示す如く、ＶＲ１の要素Ｂ（２）〜Ｂ（１０１）がＶＲ２にスライドされることになる。なお、ＶＲ１の要素番号１に格納されていた要素Ｂ（１）は、最初にＶＲ１から読み出されたとき、そのスライド先がＶＲ２には存在しないので、この読み出された要素Ｂ（１）は捨てられることになるが、ＶＲ１にはそのまま要素Ｂ（１）が残っているので、演算には何等の影響もない。
【００４３】
引き続き、前記ステップ３を実行する命令がベクトル命令制御部１３からベクトル演算器１２に送出される。これによりベクトル・レジスタＶＲ１とＶＲ２が読み出されてベクトル加算が実行され、その演算結果がベクトル・レジスタＶＲ３に格納される。
【００４４】
それからベクトル命令制御部１３は、ベクトル・アクセス・パイプ１１に対し演算結果格納命令を送出する。これによりベクトル・アクセス・パイプ１１は、ベクトル・レジスタＶＲ３に格納された演算結果を主記憶１上の配列Ａ（Ｉ）に格納し、これにより前記ＤＯループ１０のＤＯループの処理が終了する。
【００４５】
このようにして、本発明の関連技術では、図５に説明した従来例ではベクトルロード命令が２回発行されたのに比べて、その実行に長時間を必要とするベクトルロード命令の回数が１回ですみ、１回だけその回数を少なくすることができるため、オーバヘッドが小さく、高い性能を実現することができる。
【００４６】
図２に本発明の第１の実施の形態を説明する。図２において図１と同記号は同一部分を示し、１４はベクトルスライド部である。
【００４７】
ベクトルスライド部１４は、ベクトル命令制御部１３からの指示にもとづき、前記ステップ２等で説明したベクトルスライド処理を行うものである。このベクトルスライド部１４は、ベクトル演算器１２とは独立に、並行してベクトルスライド処理を実行することができる。
【００４８】
従って、ベクトル演算器１２で演算を行うことと並行してベクトルスライドを実行することができるので、ベクトル演算を高性能に行うことができる。
【００４９】
本発明の第２の実施の形態を図３にもとづき説明する。
ベクトル処理装置は、一般に複数のベクトル・レジスタ部と、複数のベクトル演算部で構成される。図３において、他図と同一記号は同一部を示し、ベクトル処理装置は、複数のベクトル・レジスタ・演算器ブロック（以下ブロックという）２１、２２、２３、２４で構成される。これらの各ブロックは同一構成であり、ブロック２２に代表的に例示されるように、ベクトル・レジスタ部１０とベクトル演算器１２を具備している。
【００５０】
ベクトル・レジスタ部１０は、図１に示す如く、複数のベクトル・レジスタＶＲ１、ＶＲ２・・・を具備し、ベクトル演算器１２は、ベクトル加算器１２−１、ベクトル乗算器１２−２、ベクトル除算器１２−３、全ブロックのベクトル・レジスタの要素の総和を求める総和演算器１２−４、前記の如きスライド処理を行うスライド機構１２−５等を具備している。
【００５１】
各ブロック２２に例示されるように、ブロック毎にベクトル・レジスタ部１０とベクトル演算器１２とは接続され、通常のベクトル演算においては、各ブロック毎に並行して独立に動作可能であり、ブロック間のデータ移動は通常発生しない。
【００５２】
ところでベクトル処理装置では、ベクトル総和演算等の各ブロックの要素間の演算を必要とする一部の特殊な命令を実行するときに、ブロック間をデータ移動するための、総和演算用データバスと称される特別なデータバスＢが設けられている。
【００５３】
またベクトル・レジスタの要素番号も、例えば１〜１００はブロック２１に、１０１〜２００はブロック２２というように、同一ブロック内に連続して配置すると、特定のブロックのみ動作状態となって負荷が集中する欠点が存在するのでブロック２１、２２、２３、２４に要素番号１、２、３、４を配置するという、いわゆるインタリーブ方式が採用される。
【００５４】
したがって前記スライド処理を行うとき、要素がブロック間にデータ移動することが発生する。このスライド処理専用のバスを設けることはそれだけピン数が多くなるので好ましいことではない。本発明では、このため、ベクトルスライド命令の場合も、総和演算等のためのデータバスＢを共用するように構成したので、ブロック間のデータバスのサイズを増加することなく、ベクトルスライド命令を実現するハードウエアを構成することが可能となった。
【００５５】
本発明の第３の実施の形態を図４により説明する。前記の如くベクトルスライド命令により、ベクトル・レジスタＶＲ１の要素が読み出されて要素番号が、例えば１つ減少する方向にスライドされる。即ちＶＲ１（Ｉ＋１）の要素をＶＲ２（Ｉ）に書き込む。第３の実施の形態では、マスクレジスタＭＲ３を設け、このとき対応するマスクレジスタＭＲ３（Ｉ）の値が例えば「１」の要素にはスライドした書き込みを行うが、「０」の要素には書き込みを行わない。したがって書き込みの行われなかった要素番号には、旧データがそのまま保持されるものとなる。
このようにして、スライド処理のときにマスク制御を行うことが可能となる。
【００５６】
本発明の第４の実施の形態を説明する。第４の実施の形態では、ベクトル・アクセス・パイプを使用せず、ベクトル・アクセス・パイプとは独立に並行してベクトルスライド処理を実行するものである。これにより主記憶に対するメモリアクセスを行いながらスライド処理を行うことができ、高速ベクトル処理が可能となる。
【００５７】
前記説明ではスライド量を要素番号が１つ減る方向にした場合について説明したが、本発明は勿論これに限定されるものではなく、増える方向にスライドしてもよく、スライド量も２以上にすることができる。
【００５８】
【発明の効果】
本発明によれば主記憶からの複数回のデータの読み出しを一回のベクトルロード命令の実行により行うことができるので、従来に比較してその実行に時間がかゝるベクトルロード命令の回数を減少することができ、ベクトルロードのオーバヘッドが削減され高速なベクトル処理が実現できる。
【図面の簡単な説明】
【図１】本発明の関連技術である。
【図２】本発明の第１の実施の形態である。
【図３】本発明の第２の実施の形態である。
【図４】本発明の第３の実施の形態である。
【図５】従来例である。
【符号の説明】
１主記憶
２ベクトル部
３スカラ部
１０ベクトル・レジスタ部
１１ベクトル・アクセス・パイプ
１２ベクトル演算器
１３ベクトル命令制御部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a vector processing apparatus, and in particular, to increase the speed of vector operations by reducing the number of executions of a vector load instruction from main memory to a vector register.
[0002]
[Prior art]
When performing vector operations in the vector processing device, it is necessary to vector-load data in the main memory 100 into the vector registers VR1 and VR2 of the vector register unit 110 of the vector unit 101 as shown in FIG. When the data on the main memory 100 is stored in the vector registers VR1 and VR2 of the vector register unit 110, a vector load instruction is executed using the vector access pipe 111. In this case, even when an array whose address is shifted by one word is loaded into the vector register 110, two vector load instructions are executed.
[0003]
For example, the following DO loop 10
DO 10 I = 1, 100
A (I) = B (I) + B (I + 1)
10 CONTINUE
As in the case of executing, it is necessary to perform main memory access twice even when the same array B is shifted by one word.
[0004]
The program of the DO loop 10 indicates that the following calculation is performed.
A (1) = B (1) + B (2) (1)
A (2) = B (2) + B (3) (2)
A (3) = B (3) + B (4) (3)
・
・
・
A (100) = B (100) + B (101) (100)
When this program is executed, the first element B (1), B (2)... B (100) in the right-hand term is transferred from the main memory 100 to the vector register unit by the first vector load instruction. 110 stored in the vector register VR1. Then, the second element B (2), B (3)... B (101) is stored from the main memory 100 into the vector register VR2 by the second vector load instruction.
[0005]
Then, the vector arithmetic unit 112 performs the calculation of each of the above expressions, and sequentially stores the calculation results in the vector register VR3.
In FIG. 5, reference numeral 102 denotes a scalar unit, and 113 denotes a vector instruction control unit.
[0006]
[Problems to be solved by the invention]
When performing a vector operation as described above, each element B (2), B (3)... B (101) stored in the vector register VR2 is replaced with each element B ( 1), B (2)... B (100) and the vector access pipe 111 accesses the main memory twice even in the case where only one word is shifted on the main memory 100. For this reason, the calculation speed is slowed down, resulting in performance degradation.
[0007]
That is, when executing the instruction of the DO loop 10, the vector instruction control unit 113 issues a first vector load instruction to the vector access pipe 111. As a result, the vector access pipe 111 reads the element (I) (I = 1 to 100) of the array B from the main memory 100 and stores it in the vector register VR1. The objects stored at this time are 100 elements B (1) to B (100) and are stored in VR1 (1) to VR1 (100).
[0008]
Next, the second vector load instruction is issued, and the vector access pipe 111 reads B (I + 1) and stores it in the vector register VR2. The objects stored at this time are 100 elements B (2) to B (101), and are stored in VR2 (1) to VR2 (100), respectively.
[0009]
Subsequently, the vector registers VR1 and VR2 are read out, vector addition is performed by the vector calculator 112, and the result is stored in the vector register VR3. Then, the result stored in the VR3 is stored in the array A (I) on the main memory 100, and this DO loop processing is completed.
[0010]
In general, vector registers are far from main memory and the performance of vector access pipes is low. For this reason, executing the vector load twice causes a large overhead and causes a decrease in performance.
[0011]
Accordingly, an object of the present invention is to provide a vector processing apparatus that reduces the number of accesses and improves the calculation speed.
[0012]
[Means for Solving the Problems]
A configuration of a related art of the present invention for achieving the above object is shown in FIG. In FIG. 1, 1 is a main memory, 2 is a vector unit, 3 is a scalar unit, 10 is a vector register unit, 11 is a vector access pipe, 12 is a vector calculator, and 13 is a vector instruction control unit. As shown in FIG. 1, the vector instruction control unit 13 can slide, for example, the element stored in the array register number VR1 to the array register number VR2. The object of the present invention is achieved by the configurations described in the following (1) to (4) .
[0013]
(1) The vector processing apparatus of the present invention comprises main storage means, vector register means, and vector operation means , reads an element held in the vector register, and shifts it in the direction of the element number. In the vector processing device for storing the vector register means in a different element number section of a register different from the vector register , the vector register means is constituted by a plurality of blocks, and each block operates independently in a normal vector operation, A dedicated data bus is provided between each block for operations such as vector summation, and elements are transferred between blocks using this dedicated data bus, and in parallel with the vector access pipe. A vector slide means for executing vector slide is provided, and the vector slide is executed in parallel with the vector calculator. And executes the ride process.
[0014]
(2) In the vector processing apparatus of the present invention, in (1), the elements are stored in a direction in which the element number increases or decreases by one.
[0015]
(3) In the vector processing device of the present invention, in the vector processing device comprising main memory means, vector register means, vector mask register means, and vector operation means, the vector register means is constituted by a plurality of blocks. In normal vector operations, each block operates independently, and a dedicated data bus is provided between the blocks for operations between elements such as vector summation, and the elements are transferred between the blocks using this dedicated data bus. When transferring, reading the element held in the vector register, shifting it in the direction of the element number, and storing it in a different element number section of the vector register different from the vector register, the reading source Merges the vector register value with the original vector register value and writes the element number The corresponding bit in the vector mask register when there is a movement in the direction of characterized by comprising the vector control means for storing only the elements that are either specific value of "1" or "0".
[0016]
(4) In the vector processing apparatus of the present invention, the main memory means, the vector register means, the vector operation means, and the element held in the vector register are read out, and this is shifted in the direction of the element number. In a vector processing apparatus for storing in a different element number section of a register different from a vector register, the vector register means is constituted by a plurality of blocks, and in normal vector operation, each block operates independently, and the vector summation A dedicated data bus is provided between each block for operations between elements such as operations, and the elements are transferred between the blocks using this dedicated data bus, and a vector arithmetic unit and a vector access pipe are independent of the vector access pipe. A vector slide means for executing vector slide in parallel is provided. And executes a vector slide processing.
[0017]
This provides the following operational effects.
(1) By reading the element held in the vector register, moving it in the direction of the element number, and storing it in a different element number section of the vector register, the number of vector load instructions can be reduced. The conventional processing is reduced from twice to once, and high-speed vector processing can be realized.
[0018]
(2) Even when a mask register is used, the number of vector load instructions is reduced, and high-speed vector processing can be realized.
[0019]
(3) Since the elements are stored in the direction in which the element number increases or decreases by one, the array shifted by one word can be made by one vector load instruction, the number of vector load instructions is reduced, and the high-speed vector Can be processed. In addition, the hardware mechanism for slide processing control is simplified, and the control becomes easy.
[0020]
(4) Since the vector slide can be executed in parallel with the vector access pipe without using the vector access pipe, the slide processing can be performed while performing the memory access. Can be processed.
[0021]
(5) Even when the vector register means is composed of a plurality of blocks, the elements can be transferred using the conventional dedicated data bus, and there is no need to provide a special pin for transfer. The number of load instructions can be reduced and high-speed vector processing can be performed.
[0022]
(6) Since the dedicated vector slide means is provided, the slide process can be executed in parallel with the vector operation means, so that the slide process and the vector operation can be performed at the same time. it can.
[0023]
DETAILED DESCRIPTION OF THE INVENTION
A related technique of the present invention will be described with reference to FIG. In FIG. 1, 1 is a main memory, 2 is a vector unit, 3 is a scalar unit, 10 is a vector register unit, 11 is a vector access pipe, 12 is a vector calculator, and 13 is a vector instruction control unit.
[0024]
The main memory 1 stores various data for operating the vector processing apparatus. Each element of the array to be subjected to vector operation is stored, the operation result is stored, and the vector unit 2 executes the data. A power instruction or the like is stored.
[0025]
The vector unit 2 performs a vector operation on the data stored in the main memory 1 and stores the operation result in the main memory 1, and includes a vector register unit 10, a vector access pipe 11, a vector operation unit 12, a vector instruction control unit 13 and the like.
[0026]
The scalar unit 3 reads an instruction from the main memory 1 and decodes it to identify whether it is a scalar instruction or a vector instruction. If it is a scalar instruction, it executes this, and if it is a vector instruction, this is sent to the vector instruction control unit 13. Send it out.
[0027]
The vector register unit 10 temporarily holds an array element to be vector-calculated by the vector calculator 12 and an array element obtained as a result of the vector calculation, and each element B of the array B read from the main memory 1 (1), B (2)... B (101) are stored in the vector register VR1 and some of the elements of the vector register VR1 B (2), B (3)... B (101) Are stored in a vector register VR2, a vector register VR3 in which a vector operation result of each element of the vector register VR1 and the vector register VR2 is stored.
[0028]
The vector access pipe 11 loads and stores vector elements in the main memory 1. In the case of loading, the vector access pipe 11 calculates the address of the access destination or loads the main memory 1. As a result, each element read from the main memory 1 is taken out and sent to the vector register unit 10. In the case of a store, the address of the access destination is calculated, a store request is issued to the main memory 1, each element to be stored is sent to the main memory 1 and stored.
[0029]
The vector arithmetic unit 12 reads out the elements having the same element number from the vector registers VR1 and VR2, performs an operation, and stores the operation result in the same element number section of the vector register VR3. The slide processing is performed as a feature.
[0030]
In this slide processing, elements B (1), B (2)... B (101) are read from the vector register VR1, and the elements are stored in the vector register VR2 as element numbers in the direction in which the number decreases by one. To do. That is, the element B (2) of the element number 2 of the vector register VR1 is stored as 2-1 = 1, that is, the element of the element number 1 of the vector register VR2, and the element B (3) of the element number 3 of the vector register VR1 is stored. ) Is stored as the element of the element number 2 of the vector register VR2, and the element B (101) of the element number 101 of the vector register VR1 is stored as the element of the element number 100 of the vector register VR2. As a result, the elements of the vector register VR1 shown in FIG. 1 are stored with their element numbers shifted in the direction of decreasing by one in the vector register VR2, as indicated by the arrow.
[0031]
In the example shown in FIG. 1, the element number is shifted from the vector register VR1 so as to be decremented by one, and stored in the vector register VR2. However, the slide process is not limited to this, and the slide process is not limited to this. The amount to be performed is not limited to 1, but an arbitrary integer of 2 or more is selected, and it is possible to slide not only in the direction in which the direction decreases but also in the direction in which the element number increases.
[0032]
When receiving a vector instruction from the scalar unit 3, the vector instruction control unit 13 decodes the vector instruction, and selectively selects the vector register 10, the vector access pipe 11, the vector arithmetic unit 12 and the like according to the contents. Gives control instructions.
[0033]
Next, the operation of the related technique of the present invention shown in FIG. 1 will be described in the case where the instruction of the DO loop 10 is executed.
[0034]
When a compiler (not shown) recognizes the entire instruction of the DO loop 10, this is a vector instruction, elements necessary for the operation are B (1) to B (101) of the array B, and these The elements B (1) to B (101) are used to perform the operations of the expressions (1) to (100), and the element stored in the vector register VR2 is the element stored in the vector register VR1. Recognizing that it should be slid in the decreasing direction and stored in the VR 2, an instruction for executing these, that is, an instruction for executing steps 1 to 3 described later, is created and stored in the main memory 1.
[0035]
(Step 1) Elements B (I) I = 1 to 101 are extracted from the main memory 1 and stored in the vector register VR1.
[0036]
(Step 2) The elements B (1) to B (101) are read from the vector register VR1, and the elements are shifted in the direction of decreasing the element number by one and stored in the vector register VR2. Therefore, B (2) is stored in element number 1 of vector register VR2, B (3) is stored in element number 2,... B (I + 1) is stored in element number (I). .. Element number 100 stores B (101). As a result, the same result as that obtained when the elements B (2) to B (101) are stored in the VR2 from the main memory 1 is obtained.
[0037]
(Step 3) The vector registers VR1 and VR2 are vector-added (I = 1 to 100), and the calculation result is stored in the vector register VR3.
[0038]
Therefore, when the scalar unit 3 fetches instructions created by the compiler from the main memory 1 and is decoded as a vector instruction, these instructions are sent to the vector instruction control unit 13.
[0039]
First, the vector load for executing the step 1 is sent from the vector instruction control unit 13 to the vector access pipe 11, and the vector access pipe 11 sends the elements B (I) and (I = 1) of the array B from the main memory 1. ˜101) are read and stored in the vector register VR1. The objects to be stored at this time are 101 elements B (1) to B (101), which are stored in the vector registers VR1 (1) to VR1 (101).
[0040]
Next, the vector command control unit 13 sends a vector slide command for executing step 2 to the vector calculator 12. As a result, the vector computing unit 12 reads the elements B (1) to B (101) from the vector register VR1, shifts the element numbers in the direction in which the element number decreases by one, and stores them in the vector register VR2.
[0041]
As a result, the element B (2) stored in the element number 2 of the vector register VR1 is stored in the element number 1 of the vector register VR2, and the element B (3) stored in the element number 3 of VR1 is The element B (101) stored in the element number 2 of VR2 and stored in the element number 101 of VR1 is stored in the element number 100 of VR2. That is, the element B (I) stored in the element number I of VR1 is stored in the element number (I-1) of VR2. Thus, each data stored in VR1 is shifted upward by one element and stored in VR2.
[0042]
In this way, as shown in FIG. 1, the elements B (2) to B (101) of VR1 are slid to VR2. Note that, when the element B (1) stored in the element number 1 of VR1 is first read from VR1, its slide destination does not exist in VR2, so this read element B (1) Will be discarded, but the element B (1) remains as it is in VR1, so there is no influence on the calculation.
[0043]
Subsequently, an instruction for executing step 3 is sent from the vector instruction control unit 13 to the vector calculator 12. As a result, the vector registers VR1 and VR2 are read out, vector addition is executed, and the operation result is stored in the vector register VR3.
[0044]
Then, the vector instruction control unit 13 sends an operation result storage instruction to the vector access pipe 11. As a result, the vector access pipe 11 stores the operation result stored in the vector register VR3 in the array A (I) on the main memory 1, thereby completing the DO loop processing of the DO loop 10.
[0045]
Thus, in the related art of the present invention , the number of vector load instructions that require a long time to execute is smaller than that in the conventional example described in FIG. Since the number of times can be reduced only once, the overhead can be reduced and high performance can be realized.
[0046]
The first embodiment of the present invention will be described with reference to FIG. 2, the same reference numerals as those in FIG. 1 denote the same parts, and reference numeral 14 denotes a vector slide part.
[0047]
The vector slide unit 14 performs the vector slide processing described in step 2 and the like based on an instruction from the vector command control unit 13. The vector slide unit 14 can execute vector slide processing in parallel with the vector calculator 12 in parallel.
[0048]
Accordingly, since vector slide can be executed in parallel with the calculation by the vector calculator 12, the vector calculation can be performed with high performance.
[0049]
A second embodiment of the present invention will be described with reference to FIG.
A vector processing apparatus is generally composed of a plurality of vector register units and a plurality of vector operation units. In FIG. 3, the same symbols as those in the other figures indicate the same parts, and the vector processing apparatus is composed of a plurality of vector / register / arithmetic unit blocks (hereinafter referred to as blocks) 21, 22, 23, 24. Each of these blocks has the same configuration, and includes a vector register unit 10 and a vector calculator 12 as representatively illustrated in block 22.
[0050]
As shown in FIG. 1, the vector register unit 10 includes a plurality of vector registers VR1, VR2,..., And the vector calculator 12 includes a vector adder 12-1, a vector multiplier 12-2, a vector division. A summation calculator 12-4 for obtaining the sum of the elements of the vector registers of all blocks, a slide mechanism 12-5 for performing the slide processing as described above, and the like.
[0051]
As exemplified in each block 22, the vector register unit 10 and the vector calculator 12 are connected for each block, and in normal vector calculation, each block can operate independently in parallel. Data movement between them usually does not occur.
[0052]
By the way, in a vector processing device, when executing some special instructions that require an operation between elements of each block such as a vector sum operation, it is called a data bus for sum operation for moving data between blocks. A special data bus B is provided.
[0053]
The vector register element numbers are also placed in the same block, such as 1 to 100 for block 21 and 101 to 200 for block 22, for example. Only a specific block is in an operating state and the load is concentrated. Therefore, a so-called interleaving method in which element numbers 1, 2, 3, 4 are arranged in the blocks 21, 22, 23, 24 is employed.
[0054]
Therefore, when the slide process is performed, the element may move data between blocks. Providing a dedicated bus for slide processing is not preferable because the number of pins increases accordingly. In the present invention, therefore, even in the case of the vector slide instruction, since the data bus B for the sum operation is shared, the vector slide instruction can be realized without increasing the data bus size between the blocks. It is now possible to configure hardware that does this.
[0055]
A third embodiment of the present invention will be described with reference to FIG. As described above, by the vector slide command, the element of the vector register VR1 is read and the element number is slid in the direction of decreasing by one, for example. That is, the element of VR1 (I + 1) is written into VR2 (I). In the third embodiment, a mask register MR3 is provided, and at this time, writing is performed on an element having a corresponding mask register MR3 (I) value of “1”, for example, but writing is performed on an element “0”. Do not do. Therefore, the old data is retained as it is in the element number that has not been written.
In this way, mask control can be performed during the slide process.
[0056]
A fourth embodiment of the present invention will be described. In the fourth embodiment, no vector access pipe is used, and vector slide processing is executed in parallel with the vector access pipe. Thus, slide processing can be performed while performing memory access to the main memory, and high-speed vector processing is possible.
[0057]
In the above description, the case where the slide amount is set to the direction in which the element number is decreased is described. However, the present invention is not limited to this, and the slide amount may be increased and the slide amount is set to 2 or more. be able to.
[0058]
【The invention's effect】
According to the present invention, it is possible to read data from the main memory a plurality of times by executing a single vector load instruction. Therefore, the number of vector load instructions that require more time to execute than in the past can be reduced. The vector load overhead can be reduced and high-speed vector processing can be realized.
[Brief description of the drawings]
FIG. 1 is a related technique of the present invention.
FIG. 2 is a first embodiment of the present invention.
FIG. 3 is a second embodiment of the present invention.
FIG. 4 is a third embodiment of the present invention.
FIG. 5 is a conventional example.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Main memory 2 Vector part 3 Scalar part 10 Vector register part 11 Vector access pipe 12 Vector computing unit 13 Vector instruction control part

Claims

A main memory means, a vector register means, and a vector operation means are provided to read out an element held in the vector register and shift it in the direction of the element number to shift it to a register different from the vector register. In a vector processing device for storing in different element number divisions ,
The vector register means is composed of a plurality of blocks, each block operates independently in normal vector operations, and a dedicated data bus is provided between the blocks for operations between elements such as vector summation. While transferring elements between blocks using a dedicated data bus,
A vector processing apparatus comprising vector slide means for executing vector slide in parallel with a vector access pipe and executing vector slide processing in parallel with a vector computing unit.

The vector processing apparatus according to claim 1, wherein elements are stored in a direction in which the element number increases or decreases by one.

In a vector processing apparatus comprising main storage means, vector register means, vector mask register means, and vector operation means,
The vector register means is composed of a plurality of blocks, each block operates independently in normal vector operations, and a dedicated data bus is provided between the blocks for operations between elements such as vector summation. While transferring elements between blocks using a dedicated data bus,
When an element held in a vector register is read out, shifted in the direction of the element number, and stored in a different element number section of a vector register different from the vector register, the read-out vector register An element whose value is merged with the original value of the destination vector register and the corresponding bit of the vector mask register when moving in the direction of the element number is a specific value of either "1" or "0" A vector processing apparatus characterized by comprising vector control means for storing only for .

Main memory means, vector register means, vector operation means, and elements held in the vector register are read out and shifted in the direction of the element number, and different elements in a register different from the vector register In the vector processing device storing the number division,
The vector register means is composed of a plurality of blocks, each block operates independently in normal vector operations, and a dedicated data bus is provided between the blocks for operations between elements such as vector summation. While transferring elements between blocks using a dedicated data bus,
Vector processing characterized by providing vector sliding means for executing vector sliding in parallel with the vector arithmetic unit independently of the vector access pipe, and executing vector sliding processing in parallel with the vector arithmetic unit apparatus.