方針

Amazon Kendra の細かい仕様とか調べるゲームのクエリに Kendra を使いたいので検索とインデックス条件を調べたいついでに必要なら Bedrock の RAG についても調べたい

Kendra Crawler / Data Source Connector

https://docs.aws.amazon.com/ja_jp/kendra/latest/dg/data-sources.html

Amazon Kendra ウェブクローラー - Amazon Kendra https://docs.aws.amazon.com/ja_jp/kendra/latest/dg/data-source-web-crawler.html

クロールできるのは、公開ウェブサイト
_ウェブサイトを選択するときは、Amazon 利用規定ポリシーおよびその他の Amazon 規約のすべてに準拠している必要
robots.txt で停止できるけど、これやる側が事前設定するのはしんどそう？共通

Abusing Amazon Kendra Web Crawler to aggressively crawl websites or web pages you don’t own is not considered acceptable use.

所有してないページへのクロールは避けよう Amazon Kendra ウェブクローラーコネクタ v2.0/TemplateConfigurationAPI

フィールドマッピング
包含/除外フィルター
フルコンテンツ同期と増分コンテンツ同期
ウェブプロキシ
ウェブサイトの基本認証、NTLM/Kerberos 認証、SAML 認証、フォーム認証
仮想プライベートクラウド (VPC)

Crawl 仕様

https://docs.aws.amazon.com/ja_jp/kendra/latest/dg/data-source-v2-web-crawler.html

Selenium ウェブクローラーパッケージと Chromium ドライバーを使用

API_TemplateConfiguration
- Data Source に繋ぐためのテンプレート https://docs.aws.amazon.com/ja_jp/kendra/latest/dg/ds-schemas.html#ds-schema-web-crawler

{
    "Id": "xx",
    "IndexId": "xx",
    "Name": "hogehoge",
    "Type": "WEBCRAWLER",
    "Configuration": {
        "WebCrawlerConfiguration": {
            "Urls": {
                "SeedUrlConfiguration": {
                    "SeedUrls": [
                        "https://aws.amazon.com/jp/hogehoge1/",
                        "https://aws.amazon.com/jp/hogehoge2/"
                    ],
                    "WebCrawlerMode": "HOST_ONLY"
                }
            },
            "CrawlDepth": 1,
            "UrlInclusionPatterns": [
                "https://aws.amazon.com/jp/.*"
            ]
        }
    },
    "CreatedAt": "xx",
    "UpdatedAt": "xx",
    "Description": "",
    "Status": "ACTIVE",
    "Schedule": "",
    "RoleArn": "hoge",
    "LanguageCode": "ja"
}

https://docs.aws.amazon.com/ja_jp/kendra/latest/dg/ds-schemas.html#web-crawler-json

https://docs.aws.amazon.com/ja_jp/AWSCloudFormation/latest/UserGuide/aws-properties-kendra-datasource-webcrawlerconfiguration.html

以下で使えるパラメータが定義されてる

Configuration": {
        "WebCrawlerConfiguration": {
            "Urls": {
                "SeedUrlConfiguration": {
                    "SeedUrls": [
                        "https://aws.amazon.com/jp/hogehoge1/",
                        "https://aws.amazon.com/jp/hogehoge2/"
                    ],
                    "WebCrawlerMode": "HOST_ONLY"
                }
            },
            "CrawlDepth": 1,
            "UrlInclusionPatterns": [
                "https://aws.amazon.com/jp/.*"
            ]
        }
    },

WebCrawlerMode ホスト、サブドメイン含む、全部　が選択できる
CrawlDepth : ページ内 URL を深さ階層まで辿って探索する
MaxContentSizePerPageInMegaBytes　探索するページサイズの最大
MaxLinksPerPage : クロールするページの最大数
UrlInclusionPatterns：探索する URL パターン

https://docs.aws.amazon.com/ja_jp/AWSCloudFormation/latest/UserGuide/aws-properties-kendra-datasource-webcrawlersitemapsconfiguration.html

sitemap : サイト内の URL を一覧化しクロール効率化に役立つファイルです

CustomDocumentEnrichmentConfiguration https://docs.aws.amazon.com/ja_jp/AWSCloudFormation/latest/UserGuide/aws-properties-kendra-datasource-customdocumentenrichmentconfiguration.html

取り込み中のドキュメントの強化 - Amazon Kendra https://docs.aws.amazon.com/ja_jp/kendra/latest/dg/custom-document-enrichment.html コンテンツおよびドキュメントのメタデータフィールドまたは属性を変更

Kendra Query API response

適当な結果のやつ

FacetResults ?
Highlights ?
Text の条件は？
- 特定タグで囲まれた部分が取得されてるように見える

{
    "$metadata": {
        "httpStatusCode": 200,
        "requestId": "hoge",
        "attempts": 1,
        "totalRetryDelay": 0
    },
    "FacetResults": [],
    "QueryId": "fuga",
    "ResultItems": [{
        "AdditionalAttributes": [],
        "DocumentAttributes": [{
            "Key": "_source_uri",
            "Value": {
                "StringValue": "https://aws.amazon.com/jp/bedrock/testimonials/"
            }
        }],
        "DocumentExcerpt": {
            "Highlights": [{
                "BeginOffset": 65,
                "EndOffset": 70,
                "TopAnswer": false,
                "Type": "STANDARD"
            }, {
                "BeginOffset": 136,
                "EndOffset": 141,
                "TopAnswer": false,
                "Type": "STANDARD"
            }],
            "Text": "わずか 3 か月で、最初の話し合いから、全世界の従業員の約 10% にあたる 1,000 人以上のアクティブユーザーがいる社内の GenAI チャットボットが機能するようになりました。チャットボットによるアイディアのクラウドソーシングは、すでに 100 を超える潜在的な GenAI ユースケースを特定し、さらに調査するのに役立っています。これらには、研究開発プロセスの強化と加速、営業チームの会議準備のサポート、お客様のメールへの迅速な対応の自動化が含まれます。"
        },
        "DocumentId": "https://aws.amazon.com/jp/bedrock/testimonials/",
        "DocumentTitle": {
            "Highlights": [],
            "Text": "基盤モデルによる生成 AI アプリケーションの構築 - Amazon Bedrock お客様の声 - AWS"
        },
        "DocumentURI": "https://aws.amazon.com/jp/bedrock/testimonials/",
        "FeedbackToken": "xxxx",
        "Format": "TEXT",
        "Id": "yy",
        "ScoreAttributes": {
            "ScoreConfidence": "LOW"
        },
        "Type": "DOCUMENT"
    }],
    "TotalNumberOfResults": 1
}

雑にdoc あさる

https://qiita.com/Naoki_Ishihara/items/7bfefc5a4750aa50c58e#%E3%82%A6%E3%82%A7%E3%83%96%E3%83%9A%E3%83%BC%E3%82%B8%E3%81%AB%E3%83%AA%E3%83%B3%E3%82%AF%E3%81%97%E3%81%A6%E3%81%84%E3%82%8B%E6%B7%BB%E4%BB%98%E3%83%95%E3%82%A1%E3%82%A4%E3%83%ABinclude-files-that-web-pages-link-to

フィールドマッピング（field mappings）

https://qiita.com/NKwest/items/60440d294473af3e9503

- Crawler v2はbodyタグ内のみインデックスの情報として持つため、一旦Field Mappingでmetaタグから取得しCDE内で利用可能にする必要があります。

Amazon Kendra 調査

方針

Kendra Crawler / Data Source Connector

Crawl 仕様

Kendra Query API response

雑にdoc あさる