WO2018105373A1

WO2018105373A1 - Information processing device, information processing method, and information processing system

Info

Publication number: WO2018105373A1
Application number: PCT/JP2017/041758
Authority: WO
Inventors: 祐平滝; 真一河野; 邦世大石; 徹哉浅山
Original assignee: ソニー株式会社
Priority date: 2016-12-05
Filing date: 2017-11-21
Publication date: 2018-06-14

Abstract

The present technology relates to an information processing device, information processing method, and information processing system, with which it is possible to carry out a smooth and natural conversation with a hard-of-hearing person. An information processing device according to an aspect of the present technology comprises: a speech acquisition unit which acquires speech information of a first user which has been inputted into a speech input device; and a display control unit which, in a display device for a second user, controls a display of text information which corresponds to the acquired speech information. On the basis of a display quantity of the text information in the display device and/or of an input quantity of the speech information which has been inputted via the speech input device, the display control unit carries out control which relates to the display quantity of the text information in the display device. The present technology may be applied to a conversation assistance device which assists with a conversation by a hard-of-hearing person.

Description

Information processing apparatus, information processing method, and information processing system

The present technology relates to an information processing apparatus, an information processing method, and an information processing system, and more particularly, to an information processing apparatus, an information processing method, and an information processing system that can support natural conversation using voice recognition.

As a voice recognition application program using a smartphone or the like, a technique for converting a user's utterance into text and displaying it on a screen is known. Furthermore, there is a technique for summarizing the text of the speech recognition result (see, for example, Patent Document 1).

WO2014-061388

However, in the case of a speech recognition application program that uses a smartphone or the like, there is a limit to the amount of text that can be displayed. For this reason, there is a need for improvement in providing communication using voice recognition.

This technology has been made in view of these circumstances, and is a technology that supports natural conversation using voice recognition.

The information processing apparatus according to an aspect of the present technology includes the voice acquisition unit that acquires the voice information of the first user input to the voice input device, and the acquired voice in the display device for the second user. A display control unit that controls display of text information corresponding to the information, wherein the display control unit displays the text information on the display device or the input amount of the voice information input from the voice input device Based on at least one of the above, control is performed regarding the display amount of the text information.

An information processing method according to one aspect of the present technology includes: an audio acquisition step of acquiring audio information of a first user input to an audio input device by the information processing device; A display control step for controlling display of text information corresponding to the acquired voice information in a display device for two users, wherein the display control step includes a display amount of the text information on the display device, Alternatively, control regarding the display amount of the text information is performed based on at least one of the input amounts of the speech information input from the speech input device.

A program that is one aspect of the present technology includes a voice input device that acquires voice information of a first user, a display control device that controls display of text information corresponding to the acquired voice information, and the display control device A display device that displays the text information for the second user in accordance with the control from the display device, the display control device being input from the display amount of the text information on the display device or the voice input device Control relating to the display amount of the text information is performed based on at least one of the input amounts of the voice information.

In one aspect of the present technology, the input voice information of the first user is acquired, and display of text information corresponding to the acquired voice information on the display device for the second user is controlled. . In this display control, control relating to the display amount of the text information is performed based on at least one of the display amount of the text information on the display device or the input amount of the speech information input from the speech input device.

According to one aspect of the present technology, natural conversation using voice recognition can be performed.

It is a figure which shows the 1st structural example of the conversation assistance apparatus to which this technique is applied. It is a block diagram which shows the internal structural example of the conversation assistance apparatus to which this technique is applied. It is a functional block diagram of an information processing part. It is a figure which shows the 2nd structural example of the conversation assistance apparatus to which this technique is applied. It is the block diagram which divided the component of the conversation assistance apparatus corresponding to the 2nd structural example. It is a figure which shows the 3rd structural example of the conversation assistance apparatus to which this technique is applied. It is the block diagram which divided the component of the conversation assistance apparatus corresponding to the 3rd structural example. It is a figure which shows the 4th structural example of the conversation assistance apparatus to which this technique is applied. It is the block diagram which divided the component of the conversation assistance apparatus corresponding to the 4th structural example. It is a flowchart explaining a display waiting list production | generation process. It is a flowchart explaining an utterance text display process. It is a figure which shows the example of already-read determination. It is a figure for demonstrating the specific example of a text amount suppression process. It is a figure for demonstrating the specific example of a text amount suppression process. It is a figure for demonstrating the specific example of a text amount suppression process. It is a figure for demonstrating the specific example of a text amount suppression process. It is a figure for demonstrating the specific example of a text amount suppression process. It is a figure for demonstrating the specific example of an edit process (erasing). It is a figure for demonstrating the specific example of an edit process (recurrence story). It is a figure for demonstrating the specific example of an edit process (NG word registration). It is a figure for demonstrating the specific example of an edit process (additional writing). It is a figure for demonstrating the application example of a conversation assistance apparatus. It is a figure for demonstrating the application example of a conversation assistance apparatus. It is a figure for demonstrating the application example of a conversation assistance apparatus. It is a figure for demonstrating the example of the feedback with respect to a speaker. It is a block diagram which shows the structural example of a computer.

Hereinafter, the best mode for carrying out the present technology (hereinafter referred to as an embodiment) will be described in detail with reference to the drawings.

<First configuration example of conversation support device according to an embodiment of the present technology>
FIG. 1 shows a first configuration example of a conversation support apparatus according to an embodiment of the present technology, and shows a case where the conversation support apparatus 10 is formed as one casing.

The conversation support device 10 is for supporting a conversation between a person who does not have anxiety about hearing (hereinafter referred to as user A) and a person who has anxiety about hearing (hereinafter referred to as user B). . It is assumed that the first user in one aspect of the present technology corresponds to the user A in the present configuration example, and the second user in one aspect of the present technology corresponds to the user 2 in the present configuration example. May be. However, the 1st user in one side of this art should just be a user who inputs a voice. That is, the first user (user who inputs voice) is not limited to a single subject (user), and may be a plurality of subjects (users). Similarly, the second user in one aspect of the present technology may be a user who visually recognizes the displayed utterance text, and is not limited to a single subject, and may be a plurality of subjects.

Specifically, the utterance of user A is converted into text (hereinafter referred to as utterance text) by voice recognition processing, and the utterance text is displayed on the display unit 43 for user B. By reading this display, the user B can understand the utterance text (character information) corresponding to the utterance (voice information) of the user A.

The utterance text displayed on the display unit 43 is displayed until the user B finishes reading or a predetermined time elapses.

In determining whether or not the displayed utterance text has been read by the user B, for example, an image of the user B from an image captured by the imaging unit 41 or an utterance of the user B collected by the sound collecting unit 42 is used. Used.

Note that the display unit 43 for the user B is provided with a display unit 22 for the user A (FIG. 2) on the back side, and the display unit 22 has the same display as the display unit 43, that is, the user A The utterance text corresponding to the utterance is displayed. Thereby, the user A can confirm whether or not the user's utterance has been correctly recognized.

<Configuration example of conversation support apparatus according to an embodiment of the present technology>
FIG. 2 is a block diagram illustrating an internal configuration example of the conversation support apparatus according to the embodiment of the present technology.

The conversation support device 10 includes a sound collection unit 21, a display unit 22, an operation input unit 23, an information processing unit 30, an imaging unit 41, a sound collection unit 42, a display unit 43, and an operation input unit 44.

The sound collection unit 21, the display unit 22, and the operation input unit 23 are provided mainly for the user A.

The sound collecting unit 21 collects the voice (utterance) spoken by the user A and supplies the corresponding speech signal to the information processing unit 30. The display unit 22 displays a screen corresponding to the image signal supplied from the information processing unit 30 (for example, an image signal for displaying an utterance text corresponding to the utterance of the user A on the screen). The operation input unit 23 receives various operations from the user A and notifies the information processing unit 30 of operation signals corresponding thereto.

The information processing unit 30 converts the speech signal supplied from the sound collection unit 21 into speech text by speech recognition processing. Further, the information processing unit 30 supplies an image signal for displaying the utterance text on the screen to the display unit 43. Details of the information processing unit 30 will be described later.

The imaging unit 41, the sound collection unit 42, the display unit 43, and the operation input unit 44 are provided mainly for the user B.

The imaging unit 41 images the user B and supplies the moving image signal obtained as a result to the information processing unit 30. The sound collecting unit 42 collects the voice (speech) spoken by the user B and supplies the corresponding speech signal to the information processing unit 30. The display unit 43 displays a screen corresponding to an image signal supplied from the information processing unit 30 for displaying the utterance text corresponding to the utterance of the user A on the screen. The operation input unit 44 receives various operations from the user B and notifies the information processing unit 30 of operation signals corresponding thereto.

<Configuration example of functional block of information processing unit 30>
FIG. 3 shows a configuration example of functional blocks included in the information processing unit 30.

The information processing unit 30 includes a voice recognition unit 31, an image recognition unit 32, a misrecognition learning unit 33, an analysis unit 35, an editing unit 36, an additional writing learning unit 37, a display waiting list holding unit 38, a display control unit 39, and a feedback unit. 40.

The speech recognition unit 31 generates an utterance text by converting an utterance signal corresponding to the utterance of the user A supplied from the sound collection unit 21 into an utterance text by speech recognition processing, and supplies the utterance text to the analysis unit 35.

In addition, the speech recognition unit 31 converts an utterance signal corresponding to the utterance of the user B supplied from the sound collection unit 42 into an utterance text by voice recognition processing, and the utterance text represents a specific keyword indicating the user B's already read. (For example, “Yes”, “Yes”, “Okay”, “OK”, “Next”, etc. registered in advance) are detected, and the detection result is supplied to the display control unit 39.

Based on the moving image signal supplied from the imaging unit 41, the image recognition unit 32 performs a specific operation indicating the user B's reading (for example, nodding, watching the screen and then looking in a direction other than the screen). It detects and supplies a detection result to the display control part 39. FIG. Further, the image recognition unit 32 measures the distance between the user B and the display unit 43 based on the moving image signal supplied from the imaging unit 41 and notifies the display control unit 39 of the measurement result. The distance between the user B and the display unit 43 is used to set the character size of the utterance text displayed on the display unit 43. For example, the longer the distance between the user B and the display unit 43, the larger the character size is set.

Note that when a wearable device such as a head-mounted display described later is used, the line-of-sight direction may be determined based on the direction of the wearable device, that is, the direction of the user B's head or body. The direction of the wearable device can be determined based on position information acquired from a camera, an acceleration sensor, a gyro sensor, or the like provided in the wearable device. Further, the Purkinje image of the eyeball of the user B and the pupil center may be determined using an infrared camera and an infrared LED, and the line-of-sight direction of the user B may be determined based on these.

The misrecognition learning unit 33 edits input from the user A or the user B to the utterance text corresponding to the utterance of the user A that is the result of the speech recognition process (for example, an erase instruction operation, a recurrent speech instruction operation, NG Corresponding to the word registration instruction operation), misrecognized words included in the utterance text are registered in the misrecognition list 34. In addition, when the utterance text corresponding to the utterance of the user A, which is the result of the speech recognition process, includes a word registered in the misrecognition list 34, the misrecognition learning unit 33 A recognition result (second candidate or the like) other than a misrecognized word (first candidate of recognition result) is requested.

The analysis unit 35 analyzes the speech text corresponding to the speech of the user A generated by the speech recognition unit 31, for example, by decomposing the speech text into parts of speech or extracting keywords.

Based on the analysis result of the analysis unit 35, the editing unit 36 controls the amount of text that specifies particles or the like that do not impair the meaning of the utterance text even if line breaks or page breaks are added or deleted as appropriate to the utterance text. An edit process such as a process is performed and supplied to the display wait list holding unit 38. In the editing process, it may be considered that at least one of line feed, page break or text amount suppression processing is performed, and at least one of line feed, page break or text amount suppression processing may be omitted.

The editing unit 36 can supply a plurality of related utterance texts to a display waiting list holding unit 38 in a thread. In this case, an icon corresponding to a thread waiting for display may be displayed while displaying the current thread. Display objects indicating threads waiting to be displayed are not limited to icons, and may be appropriately set. According to such a configuration, it is possible to easily grasp how much the user B has finished reading the other party's utterance text. Moreover, according to such a structure, the user B can act to suppress the input amount of the user A based on the progress of the utterance text.

Further, the editing unit 36 outputs a sentence of the utterance text based on the editing operation input by the user A using the operation input unit 23 with respect to the utterance text corresponding to the utterance of the user A displayed on the display unit 22. Controls the process of deleting, inserting utterance text corresponding to recurrent utterances, and registering NG words. In addition, the editing unit 36 performs an additional operation (specifically, “?”) That is input by the user A using the operation input unit 23 to the utterance text corresponding to the utterance of the user A displayed on the display unit 22. The operation of adding a symbol such as “?” To the utterance text is controlled based on the operation of adding a symbol such as “(question mark)”. Note that symbols, pictograms, emoticons, etc. other than “?” May be additionally recorded.

The editing unit 36 is based on an editing operation or an additional writing operation input by the user B using the operation input unit 44 with respect to the utterance text corresponding to the utterance of the user A displayed on the display unit 43. Editing processing can be performed. In other words, both the user A and the user B can perform an editing operation and an additional writing operation on the displayed utterance text corresponding to the utterance of the user A.

The additional writing learning unit 37 learns the additional writing operation input by the user A or the user B, and based on the learning result, even if there is no additional writing operation from the user A or the user B, the same symbol or the like is given to the same utterance text. The editing unit 36 is controlled so as to be additionally written.

For example, when an append operation for instructing the addition of “?” Is learned for the utterance text “Drug has been drunk” corresponding to the utterance of user A, Even if there is no additional operation from A or user B, the editing unit 36 is controlled to add “?” And edit “Did you take medicine?”.

The display waiting list holding unit 38 displays the edited utterance text including at least one of line feed, page break, and text amount suppression processing (the text amount suppression processing may not be performed depending on the number of characters). The information is registered in the display wait list in the sequence order, that is, in the order in which the user A speaks. When the utterance text registered in the display waiting list is read from the display control unit 39, it is deleted from the display waiting list.

The display control unit 39 reads out the utterance texts from the display waiting list in chronological order, generates an image signal for displaying the read out utterance text on the screen, and supplies it to the display unit 22 and the display unit 43. Further, the display control unit 39 displays a display amount of the utterance text currently displayed on the display unit 22 and the display unit 43, a detection result of a specific keyword that is supplied from the voice recognition unit 31 and represents the user B already read, The display amount of the utterance text on the display unit 22 and the display unit 43 is controlled based on the detection result of the specific action representing the read of the user B supplied from the image recognition unit 32. Further, the display control unit 39 sets the character size for displaying the utterance text according to the distance between the user B and the display unit 43.

The feedback control unit 40 is registered in the utterance speed of the user A, the length of the utterance of the user A, the amount of speech recognition characters per unit time, the amount of utterance text displayed on the display unit 43, and the display waiting list. Corresponding to the amount of utterance text, whether or not the user B has already read, the reading speed of the user B, etc., the utterance speed is increased (or decreased) for the user A who is the utterer by using character display or voice output. To control the feedback to notify the user, to notify the utterance, or to prompt the next utterance. Further, the feedback control unit 40 corresponds to the amount of utterance text displayed on the display unit 43, the amount of utterance text registered in the display waiting list, whether or not the user B has been read, the reading speed of the user B, etc. Then, feedback that prompts the user B to read the utterance text is controlled by using a character display or the like.

Note that the above-described functional blocks included in the information processing unit 30 do not have to be housed in the same casing, and may be arranged in a distributed manner. Some or all of these functional blocks may be arranged on a server on the Internet, a so-called cloud network.

<Second configuration example of conversation support device according to an embodiment of the present technology>
FIG. 4 illustrates a second configuration example of the conversation support apparatus according to the embodiment of the present technology. In the second configuration example, the conversation support device 10 is configured as a system including a plurality of different electronic devices. In this case, the connection between the plurality of electronic devices constituting the conversation support apparatus 10 may be wired connection or may use predetermined wireless communication (for example, Bluetooth (registered trademark), Wi-Fi (trademark), etc.). Good.

In the second configuration example, the conversation support device 10 includes a smartphone 50 used by the user A and a tablet PC (hereinafter referred to as a tablet) 60 used by the user B.

FIG. 5 shows a state in which the constituent elements of the conversation support apparatus 10 shown in FIG. 2 are divided into the smartphone 50 and the tablet PC 60.

That is, among the components of the conversation support device 10, the sound collection unit 21, the display unit 22, the operation input unit 23, and the information processing unit 30 are realized by the smartphone 50. In this case, a microphone, a display, a touch panel, and the like included in the smartphone 50 correspond to the sound collection unit 21 and the operation input unit 23, respectively. An application program executed by the smartphone 50 corresponds to the information processing unit 30.

Of the components of the conversation support device 10, the imaging unit 41, the sound collection unit 42, the display unit 43, and the operation input unit 44 are realized by the tablet 60. In this case, the camera, microphone, display, touch panel, and the like included in the tablet 60 correspond to the imaging unit 41, the sound collection unit 42, the display unit 43, and the operation input unit 44, respectively.

However, in the case of FIG. 5, the speech recognition unit 31 among the functional blocks of the information processing unit 30 is arranged in a server 72 that can be connected via the Internet 71.

<Third configuration example of conversation support device according to an embodiment of the present technology>
FIG. 6 illustrates a third configuration example of the conversation support apparatus according to the embodiment of the present technology. In the third configuration example, the conversation support device 10 is configured as a system including a plurality of electronic devices.

That is, in the third configuration example, the projector 80 that projects the video for displaying the utterance text on the smartphone 50 used by the user A and the position where the user B lying on the bed can see, for example, the wall or ceiling of the room. And a camera 110 arranged on the ceiling or the like.

FIG. 7 shows a state in which the constituent elements of the conversation support apparatus 10 shown in FIG. 2 are divided into a smartphone 50, a projector 80, and a camera 110.

That is, among the components of the conversation support device 10, the sound collection unit 21, the display unit 22, the operation input unit 23, and the information processing unit 30 are realized by the smartphone 50.

Of the components of the conversation support device 10, the imaging unit 41 and the sound collection unit 42 are realized by the camera 110. In this case, the image sensor and the microphone included in the camera 110 correspond to the imaging unit 41 and the sound collecting unit 42, respectively.

Among the components of the conversation support apparatus 10, the display unit 43 and the operation input unit 44 are realized by the projector 80. In this case, the projection unit and the remote controller included in the projector 80 correspond to the display unit 43 and the operation input unit 44, respectively.

Also in the case of FIG. 7, the voice recognition unit 31 among the functional blocks included in the information processing unit 30 is arranged in a server 72 that can be connected via the Internet 71.

<Fourth configuration example of the conversation support device according to the embodiment of the present technology>
FIG. 8 illustrates a fourth configuration example of the conversation support apparatus according to the embodiment of the present technology. In the fourth configuration example, the conversation support apparatus 10 is configured as a system including a plurality of different electronic devices.

That is, the fourth configuration example includes a neck microphone 100 used by the user A, a television receiver (hereinafter referred to as TV) 90 disposed at a position where the user A and the user B can see, The camera 110 is mounted on the TV 90.

FIG. 9 shows a state in which the components of the conversation support device 10 shown in FIG. 2 are divided into a neck microphone 100, a TV 90, and a camera 110.

That is, among the components of the conversation support device 10, the sound collection unit 21 is realized by the neck microphone 100. Note that the neck microphone 100 may be provided with a speaker that outputs sound in addition to the sound collecting unit 21.

Among the components of the conversation support device 10, the imaging unit 41 and the sound collection unit 42 are realized by the camera 110.

Among the components of the conversation support device 10, the display unit 43 and the operation input unit 44 are realized by the TV 90. In this case, the display and remote controller included in the TV 90 correspond to the display unit 43 and the operation input unit 44, respectively. It is assumed that the display and the remote controller included in the TV 90 also serve as the display unit 22 and the operation input unit 23 for the user A.

Also in the case of FIG. 9, the voice recognition unit 31 among the functional blocks of the information processing unit 30 is arranged in a server 72 that can be connected via the Internet 71.

As in the first to fourth configuration examples described above, the conversation support device 10 can be configured as one electronic device, or can be configured as a system in which a plurality of electronic devices are combined. The first to fourth configuration examples described above can be combined as appropriate.

Further, as an electronic device that configures the conversation support device 10 as a system, a wearable device such as a clock-type terminal or a head-mounted display, a monitor for a PC (personal computer), or the like can be employed in addition to the above-described example.

<Operation of Conversation Support Device 10>
Next, the operation of the conversation support apparatus 10 will be described.

FIG. 10 is a flowchart for explaining display wait list generation processing by the conversation support apparatus 10. This display waiting list generation process is repeatedly executed until the power is turned off after the conversation support device 10 is activated.

In step S1, when the user A speaks, the sound is acquired by the sound collecting unit 21. The sound collecting unit 21 converts the voice of the user A into an utterance signal and supplies it to the information processing unit 30. In step S 2, in the information processing unit 30, the speech recognition unit 31 converts the speech signal corresponding to the speech of the user A into speech text by performing speech recognition processing.

In step S3, the analysis unit 35 analyzes the utterance text corresponding to the utterance of the user A generated by the voice recognition unit 31. In step S4, the editing unit 36 performs an editing process including at least one of a line feed, a page break, or a text amount suppression process on the utterance text corresponding to the utterance of the user A based on the analysis result, The processed utterance text is supplied to the display waiting list holding unit 38.

In step S5, the display waiting list holding unit 38 holds the edited utterance texts supplied from the editing unit 36 in chronological order. Thereafter, the process returns to step S1, and the subsequent steps are repeated.

FIG. 11 is a flowchart for explaining utterance text display processing by the conversation support apparatus 10. The utterance text display process is repeatedly executed in parallel with the display wait list generation process described above until the power is turned off after the conversation support device 10 is activated.

In step S11, the display control unit 39 determines whether or not the utterance text is currently displayed on the screens of the

display units

22 and 43. If it is determined that it is displayed, the process proceeds to step S12. In step S12, the display control unit 39 determines whether or not a predetermined shortest display time has elapsed since the display of the currently displayed utterance text has started, and the shortest display time has elapsed. Wait until. If the shortest display time has elapsed, the process proceeds to step S13.

In step S 13, the display control unit 39 represents the detection result of the specific keyword representing the user B's read, which is supplied from the voice recognition unit 31, and the user B's read that is supplied from the image recognition unit 32. Based on the detection result of the specific action, it is determined whether or not the reading of the user B with respect to the displayed utterance text has been detected.

FIG. 12 shows an example of determination of the read detection of the user B in step S13.

For example, when a specific keyword representing a read such as “Yes” is detected from the speech recognition result of the utterance by the user B, it is assumed that the user B understands when a specific keyword indicating the read is detected, and the user B has already read Is determined to have been detected.

In addition, for example, when a specific operation representing a read such as a nod is detected from the image recognition result of a moving image obtained by capturing the user B, the user B can detect when a specific operation indicating the read is detected a predetermined number of times (for example, twice). It is estimated that the user B has been read, and it is determined that the user B has been read.

Further, for example, when a state in which the user B looks at the direction other than the screen after the user B is gazing at the screen (display unit 43) is detected from the image recognition result of the moving image obtained by capturing the user B, this continues for a predetermined time. It is estimated that the user B understands at the time, and it is determined that the user B has been detected.

Further, for example, when it is detected from the voice recognition result of the utterance by the user A that the user A has newly uttered, the conversation is progressing between the user A and the user B when the user A can be detected. It is estimated that B understands, and it is determined that the user B's read has been detected.

In addition, the read determination of the user B is not limited to the above-described example. For example, the user may arbitrarily add a specific keyword indicating read or a specific operation indicating read.

Returning to FIG. 12, if the user B has not been read in step S 13, the process proceeds to step S 14. In step S14, the display control unit 39 determines whether or not a predetermined longest display time has elapsed since the display of the currently displayed utterance text has started, and the longest display time has elapsed. The process returns to step S13 until steps S13 and S14 are repeated. Then, when the user B has been read or the longest display time has elapsed, the process proceeds to step S15.

In step S 15, the display control unit 39 reads the utterance texts from the display waiting list in time series order, generates an image signal for displaying the read utterance texts on the screen, and supplies the image signals to the display unit 22 and the display unit 43. . At this time, if the screens of the display unit 22 and the display unit 43 are already full of utterance text, the screen is scrolled, the utterance text that was displayed first disappears from the screen, and is newly read from the display wait list. The issued utterance text is displayed on the screen.

If it is determined in step S11 that the utterance text is not currently displayed on the screens of the

display units

22 and 43, steps S12 to S14 are skipped, and the process proceeds to step S15.

After this, the process returns to step S11 and the subsequent steps are repeated.

As described above, the display waiting list generation process and the utterance text display process are executed in parallel, so that the user A's utterance is presented to the user B as the utterance text, The display of the utterance text is advanced.

<Specific example of editing processing including at least one of line feed, page break, or text amount suppression processing>
Next, a specific example of editing processing including at least one of line feed, page break, or text amount suppression processing by the editing unit 36 will be described.

FIG. 13 shows a situation where, for example, a user A who is an elementary school student and a user B who is a mother have a conversation using the conversation support device 10. In the figure, it is assumed that the user A uttered without saying at a stretch that he said, “If you went to school yesterday, you were told to collect 10000 yen because you collected money for school trips.”

FIG. 14 shows a display example of the display unit 43 in the situation shown in FIG. 14A shows a state in which the editing process is not reflected, FIG. 14B shows a state in which line breaks and page breaks are reflected in the editing process, and FIG. 14C shows a line break and page break. The page and the text amount suppression processing are all reflected.

As shown in FIG. 13, when the user A occurs at a stroke without dividing the utterance, the display unit 43 initially displays the utterance text that does not reflect the editing process as shown in A of FIG. 14. Is done. In this state, line breaks and page breaks are made regardless of the meaning and context, so it is difficult to read, and the numerical value (10000 yen in the case of the figure) is divided in the middle, so the numerical value can be misunderstood. There is sex.

When the user B performs a first operation (for example, an operation of tapping the screen) with respect to the display of FIG. 14A, line breaks and page breaks in the editing process are reflected, which are shown in FIG. 14B. As described above, line breaks and page breaks are performed according to the meaning and context of the utterance text, making it easier to read and the effect of suppressing misunderstandings such as numerical values.

When user B performs a second operation (for example, an operation of double-tapping the screen) on the display of B in FIG. 14, the text amount suppression process is further reflected, as shown in C in FIG. 14. The text volume of the utterance text is suppressed without impairing the meaning or context. Therefore, in addition to the above-described effects, it is possible to expect an effect that the time required for the user B to read can be shortened.

Note that when the user B performs a third operation (for example, an operation of swiping the screen) on the display of C in FIG. 14, the displayed utterance text may be deleted from the screen.

Further, when the user B performs the first operation with respect to the display of B of FIG. 14, the display of A of FIG. 14 may be returned. Similarly, when the user B performs the second operation on the display shown in FIG. 14C, the display may return to the display shown in FIG. 14B.

Alternatively, when the user B performs the first operation with respect to the display of FIG. 14A, the display of FIG. 14B is displayed, and when the user B performs the first operation, the display of FIG. When the user B performs the first operation, the displayed utterance text may be erased from the screen. Thereafter, each time the user B performs the first operation again, the display may return to the display of C in FIG. 14, B in FIG. 14, or A in FIG. 14.

In the above description, the editing process is reflected in the displayed utterance text in response to the operation by the user B. However, the editing process is performed on the displayed utterance text in accordance with the operation by the user A. It is also possible to reflect. In addition, at least one of the first operation, the second operation, or the third operation may be regarded as a predetermined operation in one aspect of the present technology.

<Other specific examples of editing processing including text amount suppression processing>
Next, another specific example of editing processing including text amount suppression processing will be described.

FIG. 15 shows a situation where user A and user B have a conversation using the conversation support device 10. However, illustration of user B is omitted. In the case of the figure, it is assumed that the user A utters a relatively short sentence such as “Good morning”, “Tomorrow at 10 o'clock at the Shinagawa station”.

FIG. 16 shows a display example on the display 43 of the utterance text corresponding to the utterance of the user A shown in FIG. When the user A utters a sentence with a relatively short utterance, the utterance text corresponding to the sentence is also displayed divided into short parts as shown in FIG. In the case of the figure, the utterance text other than “Good morning” is displayed in a state in which the text amount suppression process for deleting the nouns and the verbs and deleting the particles is reflected. That is, in the text amount suppression process of this example, parts of speech that are less important for understanding the meaning and context of the utterance text are omitted as appropriate. Note that the wording to be omitted is not limited to the part of speech, and may be appropriately set by the user.

Instead of deleting particles that do not impair the meaning or context of the utterance text, the particles may be displayed less prominently than nouns or verbs related to the meaning or context of the utterance text. Good. In other words, the utterance text may be displayed such that nouns, verbs, etc. stand out from particles, etc.

FIG. 17 shows a display example in which the size of the particles such as particles is made smaller than the nouns and verbs related to the meaning and context of the utterance text so that the nouns and verbs stand out.

Although not shown in the figure, the color of particles such as particles is light and the color of characters such as nouns and verbs is displayed darkly, the brightness of characters such as particles is low, and the brightness of characters such as nouns and verbs is reduced. The line may be displayed higher, or the line of a character such as a particle may be thinned, and the line of a character such as a noun or a verb may be displayed thickly.

As described above, if a noun, a verb, or the like that affects the meaning of the utterance text is displayed conspicuously without indicating a particle or the like that does not affect the meaning of the utterance text, the user B can display an insignificant particle, etc. Without reading, you will read prominent nouns and verbs. Therefore, it is possible to shorten the time required for the user B to read without losing the meaning of the utterance text.

<Specific Example of Editing Process by Editing Unit 36>
Next, the editing process corresponding to the button operation by the user for the utterance text displayed on the screen will be described.

FIG. 18 shows a display example when the delete button 111 is provided corresponding to each utterance text displayed on the display unit 22 for the user A. Each utterance text shown in FIG. 18 corresponds to the utterance to user A shown in FIG.

For example, when the user A finds misrecognition in the utterance text that is the voice recognition result of his / her utterance, the utterance text can be deleted by operating the delete button 111.

In the case of the display example in FIG. 18, the word that should be recognized as “Shinagawa” is misrecognized as “Sanagawa”, so that when user A who has found this misrecognition operates the delete button 111, The utterance text including “Jonagawa” is deleted. Then, the misrecognition learning unit 33 learns that the utterance text including “Self” is erased (registered in the misrecognition list 34).

That is, by operating the delete button 111, the user A can delete the misrecognized utterance text or the utterance text corresponding to the wrong utterance.

Note that the delete button 111 can also be provided on the display section 43 for the user B. In this case, the user B can erase the utterance text that has been read, for example, by operating the erase button 111.

When the utterance text is erased by the operation of the erase button 111 by the user B, the fact is notified to the user A side. Thereby, the user A can confirm the read of the user B with respect to the erased utterance text. On the contrary, when the utterance text is erased by the operation of the erase button 111 by the user A, the fact may be notified to the user B side. This notification method may use screen display or audio output.

FIG. 19 shows a display example when a re-utterance button 112 is provided corresponding to each utterance text displayed on the display unit 22 for the user A. Note that each utterance text shown in FIG. 19 corresponds to the utterance to user A shown in FIG.

For example, when the user A finds misrecognition in the utterance text that is the voice recognition result of his / her utterance, the user can re-phrase (speak again) the utterance text by operating the recurrence button 112.

In the case of the display example in FIG. 19, the word that should be recognized as “Shinagawa” is misrecognized as “Sanakawa”, so that the user A who has found this misrecognition operates the recurrence button 112. When you say “Gather to Shinagawa at 10:00 tomorrow” again, the currently displayed “Gather to Jonagawa tomorrow at 10:00” will be the utterance text that is the speech recognition result of the recurrent speech (correctly recognized) In case “Tomorrow we will gather at Shinagawa at 10:00”). Further, it is learned by the misrecognition learning unit 33 that the utterance text including “Sanagawa” has been replaced (registered in the misrecognition list 34).

That is, the user A operates the recurrence button 112 to replace the misrecognized utterance text or the display of the utterance text corresponding to the wrong utterance with the utterance text corresponding to the recurrence at the position. Can do.

Instead of re-speaking the entire utterance text (in this case, “Tomorrow is a set”), select a word (for example, Jonagawa) so that only that word can be re-spoken. Good.

Also, the re-utterance button 112 can be provided on the display section 43 for the user B. In that case, in response to the user B operating the re-utterance button 112, the user A is notified so as to prompt the re-utterance. This notification method may use screen display or audio output.

FIG. 20 shows a display example when an NG word registration button 113 is provided corresponding to each utterance text displayed on the display unit 22 for the user A. In addition, each utterance text shown by FIG. 20 respond | corresponds to the utterance to the user A shown by FIG.

For example, if user A finds misrecognition in the utterance text that is the speech recognition result of his utterance and does not want the misrecognition result to appear again, it is registered as an NG word by operating the NG word registration button 113 can do.

In the case of the display example in FIG. 20, some utterance of user A is misrecognized as “erotic comic” and displayed, and user A who does not want to display this word again uses NG word registration button 113. When the operation is performed, the displayed utterance text “erotic comic” is deleted, and “erotic comic” is registered as an NG word in the misrecognition learning unit 33 (registered in the misrecognition list 34).

That is, by operating the NG word registration button 113, the user A can register a word that is erroneously recognized and is not desired to be displayed again as an NG word.

Note that the NG word registration button 113 can also be provided on the display section 43 for the user B. In that case, the user B can also register a word that he / she does not want to redisplay as an NG word by operating the NG word registration button 113.

FIG. 21 shows a display example when the append button 114 is provided corresponding to each utterance text displayed on the display unit 22 for the user A. Each utterance text shown in FIG. 21 corresponds to the utterance to user A shown in FIG.

For example, if the user A thinks that adding “?” To the utterance text that is the voice recognition result of the utterance of his / her own question or the like will increase the understanding of the user B, by operating the append button 114, “?” Can be added to the utterance text.

The display example of FIG. 21 shows a result of the operation of the add button 114, and “?” Is added to the utterance text “Drinking medicine for lunch today” corresponding to the utterance of the user A. Yes. In this case, the fact that “?” Is added to “Today's lunch is already taken” is registered in the additional writing learning unit 37.

That is, the user A can add “?” To the utterance text by operating the add button 114.

In addition, the append button 114 can be provided on the display unit 43 for the user B. In that case, when the user B does not understand the meaning of the displayed utterance text or wants to know more detailed contents, the user B selects a word or the like included in the displayed utterance text, and then appends it. By operating the button 114, the user A can be inquired about the meaning of a word or the like.

It should be noted that when the append button 114 is operated so that symbols other than “?”, Pictograms, emoticons, etc. can be additionally recorded, the user may select a symbol to be added.

In the above description, the delete button 111, the recurrence button 112, the NG word registration button 113, and the postscript button 114 are displayed individually, but they may be displayed simultaneously.

In addition, instead of displaying each button, a predetermined touch operation (for example, when the operation input unit 23 is a touch panel, a double operation is performed in response to an erase instruction, a recurrent speech instruction, an NG word registration, and an additional write instruction. Operation, long tap operation, flick operation, etc.) may be assigned. Furthermore, instead of displaying each button, a three-dimensional gesture operation performed by the user A or the user B may be assigned to the deletion instruction, the recurrent speech instruction, the NG word registration, and the additional recording instruction. Here, the touch operation may be regarded as a two-dimensional gesture operation. Further, the three-dimensional gesture operation may be performed using a controller included in the acceleration sensor or the gyro sensor, or may be performed using an image recognition result related to the user's operation. In this specification, these touch operations and three-dimensional gesture operations may be simply referred to as “gesture operations”.

Note that when a wearable device, particularly a head-mounted display, is used, a nodding operation of the user B, an operation of shaking the head, and the like can be assigned as the gesture operation. When the gaze detection function is employed in the wearable device, a physical action according to the movement of the gaze of the user B with respect to the displayed utterance text may be learned as a gesture operation. According to such a configuration, it is possible to improve the accuracy of the already-read determination according to the gesture operation.

Furthermore, instead of displaying each button, a predetermined magic word uttered by the user A or the user B may be assigned to the erasure instruction, the recurrent speech instruction, the NG word registration, and the additional writing instruction.

Further, when the user A performs a predetermined gesture assigned to the erasure instruction or utters a predetermined magic word immediately after the utterance, the display of the utterance text corresponding to the utterance can be stopped. Good.

Here, the discontinuation of the display of the utterance text can include the discontinuation of the display of the text in the middle of the analysis, that is, the discontinuation of the display processing of the undisplayed text. When the display of the utterance text is to be stopped, one sentence immediately before the erasure instruction is performed may be collectively erased by analyzing text information. As a result, it is possible to cancel text information (such as after words or fillers) that the user A has unintentionally input. In addition, when the user A utters a predetermined gesture or a predetermined magic word indicating that voice input is not performed before the utterance, the information processing unit 30 immediately follows the predetermined gesture or the predetermined magic word. The display of the voice input that is input to may be prohibited. Thereby, since the user A can arbitrarily select a state in which no utterance is transmitted, the display of an unintended utterance can be suppressed.

<Application example of conversation support device 10>
Next, an application example of the conversation support apparatus 10 will be described.

FIG. 22 shows an example of a usage situation when the conversation support apparatus 10 can be used by three or more people. In the case of the figure, the conversation support device 10 is used to support the conversation between the users A1, A2 and A3 who are not worried about hearing and the user B who is worried about hearing.

Each of the users A1 to A3 has a smartphone 50 for the user A, and the utterance texts corresponding to the utterances collected by the smartphones 50 existing in a predetermined distance range are grouped. Are collectively displayed on the display unit 43.

In addition, as a method of detecting the smartphone 50 existing in a predetermined distance range, for example, each smartphone 50 outputs a predetermined sound wave to each other, and can be realized by collecting and analyzing the sound wave output by other than itself. . Further, for example, the smartphone 50 may be detected from an image obtained by the camera 110 installed on the ceiling, and the position of each smartphone 50 may be specified.

In the display section 43 for the user B, utterance texts corresponding to the utterances of the users A1 to A3 are displayed in time series, and the displayed utterance text is uttered by whom of the users A1 to A3. The speaker mark 121 representing the speaker is displayed in association with each utterance text so that the user B can determine whether it is present.

FIG. 23 shows the direction in which the speaker is in the state where the user B looks at the display unit 43 as another method for indicating who the user A1 to A3 is speaking the displayed utterance text. Is displayed on the screen.

In the case of FIG. 23, since the utterance text corresponding to the utterance of the user A3 on the right side when the user B looks at the display unit 43 is displayed on the screen, the utterance direction instruction mark is displayed on the right side of the screen of the display unit 43. 131 is displayed.

The relative directions of the users A1, A2, and A3 when the user B looks at the display unit 43 can be detected from an image obtained by the camera 110 installed on the ceiling, for example.

FIG. 24 shows a situation in which the user A and the user B facing each other across the table are using the conversation support device 10. In this case, the projector 80 may collectively project the screen of the display unit 22 for the user A and the screen of the display unit 43 for the user B onto the table. At this time, the screen of the display unit 22 for the user A is displayed in a direction that the user A can easily read, and the screen of the display unit 43 for the user B is displayed in a direction that the user B can easily read.

<Feedback for user A who is a speaker>
FIG. 25 shows an example of feedback to the user A who is the speaker among the users who are using the conversation support device 10.

For example, when the display of the utterance text on the display unit 43 is full, the feedback control unit 40 controls the user A who is the speaker, for example, “SlowSdown”, “The screen is full”. , "Please speak slowly", "Please wait", "Please divide once", "Feed unread", etc. feedback to inform you to slow down the speaking rate, text display and voice output using smartphone 50 etc. Is done by.

It should be noted that an indicator corresponding to the utterance speed of the user A and the length of the utterance break may be displayed on the screen, or an alarm sound or the like may be output.

When the user A speaks at an optimum speed or segmentation for voice recognition or screen display, the user A is given points, and the user A performs some service according to the given points. Benefits and rankings may be obtained.

<Other application examples>
In the present embodiment, the conversation support device 10 is used for the purpose of supporting the conversation between the user A who is not anxious about hearing and the user B who is uneasy about hearing. For example, the present invention can be applied to applications that support conversations between people using different languages. In that case, a translation process may be performed after the voice recognition process.

Alternatively, the conversation support device 10 may capture the mouth when the user A speaks as a moving image, display the utterance text, and display the moving image of the user A's mouth. In this case, the display of the utterance text and the motion of the moving image of the user A's mouth may be displayed in synchronization. In this case, the conversation support device 10 can be used for learning lip reading, for example.

Also, the conversation support device 10 may record the utterance of the user A and store the utterance text that is the voice recognition result in association with the utterance text so that the saved result can be reproduced and displayed later.

Furthermore, not only a real-time utterance by the user A but also a recorded voice may be input to the conversation support device 10.

<Another configuration example of the information processing unit 30>
The series of processes described above can be executed by hardware or can be executed by software. When a series of processing is executed by software, a program constituting the software is installed in the computer. Here, the computer includes, for example, a general-purpose computer capable of executing various functions by installing a computer incorporated in dedicated hardware and various programs. The smartphone 50 in the second configuration example described above corresponds to the computer.

FIG. 26 is a block diagram showing an example of the hardware configuration of a computer that executes the above-described series of processing by a program.

In this computer 200, a CPU (Central Processing Unit) 201, a ROM (Read Only Memory) 202, and a RAM (Random Access Memory) 203 are connected to each other by a bus 204.

An input / output interface 205 is further connected to the bus 204. An input unit 206, an output unit 207, a storage unit 208, a communication unit 209, and a drive 210 are connected to the input / output interface 205.

The input unit 206 includes a keyboard, a mouse, a microphone, and the like. The output unit 207 includes a display, a speaker, and the like. The storage unit 208 includes a hard disk, a nonvolatile memory, and the like. The communication unit 209 includes a network interface and the like. The drive 210 drives a removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.

In the computer 200 configured as described above, for example, the CPU 201 loads the program stored in the storage unit 208 to the RAM 203 via the input / output interface 205 and the bus 204 and executes the program. A series of processing is performed.

The program executed by the computer 200 (CPU 201) can be provided by being recorded in, for example, a removable medium 211 such as a package medium. The program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.

In the computer 200, the program can be installed in the storage unit 208 via the input / output interface 205 by attaching the removable medium 211 to the drive 210. The program can be received by the communication unit 209 via a wired or wireless transmission medium and installed in the storage unit 208. In addition, the program can be installed in the ROM 202 or the storage unit 208 in advance.

Note that the program executed by the computer 200 may be a program that is processed in time series in the order described in this specification, or a necessary timing such as when a call is made in parallel. It may be a program in which processing is performed.

Note that the embodiments of the present technology are not limited to the above-described embodiments, and various modifications can be made without departing from the gist of the present technology.

The present technology can also have the following configurations.
(1)
A voice acquisition unit that acquires the voice information of the first user input to the voice input device;
A display control unit for controlling display of text information corresponding to the acquired voice information in a display device for a second user,
The display control unit controls the display amount of the text information based on at least one of the display amount of the text information on the display device or the input amount of the voice information input from the voice input device. Processing equipment.
(2)
The information processing apparatus according to (1), wherein the display control unit suppresses a display amount of the text information when a display amount of the text information becomes a predetermined amount or more.
(3)
The information processing apparatus according to (1) or (2), wherein the display control unit suppresses a display amount of the text information by suppressing a display amount of a predetermined part of speech included in the text information.
(4)
The information processing unit according to any one of (1) to (3), wherein the display control unit suppresses a display amount of the text information based on a predetermined operation by the first user or the second user. apparatus.
(5)
The predetermined operation includes a first operation by the first user or the second user,
The information processing apparatus according to (4), wherein the display control unit deletes the display of the text information based on the first operation after suppressing the display amount of the text information.
(6)
The predetermined operation includes a second operation by the first user or the second user,
The information processing apparatus according to (5), wherein after the display of the text information is erased, the display control unit displays the text information erased in the display device again based on the second operation. .
(7)
The information processing apparatus according to any one of (1) to (6), wherein the display control unit controls at least one of a line feed or a page break of the text information display according to the analysis result of the text information.
(8)
When one of the first user or the second user performs an operation related to the text information, information indicating that an operation related to the text information has been performed is used as information indicating the first user or the second user. The information processing apparatus according to any one of (1) to (7), further including a notification unit that notifies the other of the information.
(9)
When one of the first user or the second user performs an operation of suppressing the amount of display of the text information, the notification unit may notify the other of the first user or the second user, The information processing apparatus according to (8), which notifies that the display amount of the text information is suppressed.
(10)
When one of the first user or the second user performs an operation of deleting the display of the text information, the notification unit may notify the other of the first user or the second user, The information processing apparatus according to (8) or (9), wherein the information indicating that the display of the text information has been deleted is notified.
(11)
When the second user performs an operation for requesting re-speech of the text information displayed on the display device, the notification unit performs a notification to prompt the re-speech to the first user. ) To (10).
(12)
When the second user performs an operation for requesting an inquiry about the text information displayed on the display device, the notification unit has received an inquiry about the text information from the first user. The information processing apparatus according to any one of (8) to (11).
(13)
The display control unit suppresses a display amount of the text information on the display device based on a result of the second user's read detection based on at least one of the second user's utterance or action. The information processing apparatus according to any one of 1) to (12).
(14)
The information processing apparatus according to any one of (1) to (13), wherein the display control unit stops displaying the text information on the display device based on at least one of the utterance or action of the first user. .
(15)
A feedback control unit that controls notification of feedback information to at least one of the first user and the second user based on at least one of a display amount of the text information and an input amount of the audio information on the display device; The information processing apparatus according to any one of (1) to (14).
(16)
The information processing apparatus according to (15), wherein the feedback information is information that prompts the first user to change at least one of an utterance speed and an utterance break.
(17)
The information processing apparatus according to (15) or (16), wherein the feedback information is information that prompts the second user to read the text information displayed on the display device.
(18)
A voice recognition unit that converts the voice information of the first user into the text information;
The information processing apparatus according to any one of (1) to (17), wherein the voice recognition unit is provided inside the information processing apparatus or on a server connected via the Internet.
(19)
In the information processing method of the information processing apparatus,
According to the information processing apparatus,
A voice acquisition step of acquiring voice information of the first user input to the voice input device;
A display control step for controlling display of text information corresponding to the acquired voice information in a display device for a second user,
The display control step performs control related to the display amount of the text information based on at least one of the display amount of the text information on the display device or the input amount of the voice information input from the voice input device. Processing method.
(20)
A voice input device for acquiring voice information of the first user;
A display control device for controlling display of text information corresponding to the acquired voice information;
A display device for displaying the text information for a second user in accordance with control from the display control device;
The display control device performs control related to a display amount of the text information based on at least one of a display amount of the text information on the display device or an input amount of the voice information input from the voice input device. Processing system.

10 conversation support devices, 21 sound collection unit, 22 display unit, 23 operation input unit, 30 information processing unit, 31 voice recognition unit, 32 image recognition unit, 33 misrecognition learning unit, 34 misrecognition list, 35 analysis unit, 36 Editing unit, 37 additional learning unit, 38 display waiting list holding unit, 39 display control unit, 40 feedback control unit, 41 imaging unit, 42 sound collection unit, 43 display unit, 44 operation input unit, 50 smartphone, 60 tablet PC , 80 projector, 90 TV, 100 head microphone, 110 camera, 111 erase button, 112 recurrence button, 113 NG word registration button, 114 additional button, 200 computer, 201 CPU

Claims

A voice acquisition unit that acquires the voice information of the first user input to the voice input device;
A display control unit for controlling display of text information corresponding to the acquired voice information in a display device for a second user,
The display control unit controls the display amount of the text information based on at least one of the display amount of the text information on the display device or the input amount of the voice information input from the voice input device. Processing equipment.
The information processing apparatus according to claim 1, wherein the display control unit suppresses a display amount of the text information when a display amount of the text information becomes a predetermined amount or more.
The information processing apparatus according to claim 2, wherein the display control unit suppresses a display amount of the text information by suppressing a display amount of a predetermined part of speech included in the text information.
The information processing apparatus according to claim 2, wherein the display control unit suppresses a display amount of the text information based on a predetermined operation by the first user or the second user.
The predetermined operation includes a first operation by the first user or the second user,
The information processing apparatus according to claim 4, wherein the display control unit deletes the display of the text information based on the first operation after suppressing the display amount of the text information.
The predetermined operation includes a second operation by the first user or the second user,
The information processing apparatus according to claim 5, wherein the display control unit displays the text information erased in the display device again based on the second operation after erasing the display of the text information.
The information processing apparatus according to claim 2, wherein the display control unit controls at least one of a line feed or a page break of the text information display according to the analysis result of the text information.
When one of the first user or the second user performs an operation related to the text information, information indicating that an operation related to the text information has been performed is used as information indicating the first user or the second user. The information processing apparatus according to claim 1, further comprising a notification unit that notifies the other of the information.
When one of the first user or the second user performs an operation of suppressing the amount of display of the text information, the notification unit may notify the other of the first user or the second user, The information processing apparatus according to claim 8, notifying that a display amount of the text information is suppressed.
When one of the first user or the second user performs an operation of deleting the display of the text information, the notification unit may notify the other of the first user or the second user, The information processing apparatus according to claim 8, which notifies that the display of text information has been erased.
The notification unit, when the second user performs an operation for requesting a re-speech of the text information displayed on the display device, performs a notification that prompts the first user to re-speak. The information processing apparatus described in 1.
When the second user performs an operation for requesting an inquiry about the text information displayed on the display device, the notification unit has received an inquiry about the text information from the first user. The information processing apparatus according to claim 8.
The display control unit suppresses a display amount of the text information on the display device based on a result of the second user's read detection based on at least one of the second user's utterance or action. The information processing apparatus according to 1.
The information processing apparatus according to claim 1, wherein the display control unit stops displaying the text information on the display device based on at least one of the first user's utterance or action.
A feedback control unit that controls notification of feedback information to at least one of the first user and the second user based on at least one of a display amount of the text information and an input amount of the audio information on the display device; The information processing apparatus according to claim 1.
The information processing apparatus according to claim 15, wherein the feedback information is information prompting the first user to change at least one of an utterance speed and an utterance break.
The information processing apparatus according to claim 15, wherein the feedback information is information that prompts the second user to read the text information displayed on the display device.
A voice recognition unit that converts the voice information of the first user into the text information;
The information processing apparatus according to claim 1, wherein the voice recognition unit is provided inside the information processing apparatus or on a server connected via the Internet.
In the information processing method of the information processing apparatus,
According to the information processing apparatus,
A voice acquisition step of acquiring voice information of the first user input to the voice input device;
A display control step for controlling display of text information corresponding to the acquired voice information in a display device for a second user,
The display control step performs control related to the display amount of the text information based on at least one of the display amount of the text information on the display device or the input amount of the voice information input from the voice input device. Processing method.
A voice input device for acquiring voice information of the first user;
A display control device for controlling display of text information corresponding to the acquired voice information;
A display device for displaying the text information for a second user in accordance with control from the display control device;
The display control device performs control related to a display amount of the text information based on at least one of a display amount of the text information on the display device or an input amount of the voice information input from the voice input device. Processing system.