Import content from Web pages

You can import the content in Web pages into a knowledge base. You can do this all at once when you create the knowledge base, or you can import some Web pages then and others later.

Important notes

  • You can only import Web pages if you've activated our Generative AI features. If your interest in Large Language Models (LLMs) and Generative AI is piqued, don’t hesitate to get started with LivePerson’s trustworthy solution. We continue to roll out new features and enhancements that support safe, responsible, and equal AI. Join us on the journey.
  • You can import publicly accessible Web pages or those stored on your computer.
  • New articles created from the Web pages are enabled by default, so ensure the content is suitable before you import.

Best practices

The challenge with importing a Web page in particular is that typically you don’t have control over the formatting and design of the page. So you can’t modify it for best results when importing. As a result, review the articles after the import to ensure they’re as you expect and desire.

Limitations

  • Navigation bars, iframes, and style tags are ignored.
  • Headers and footers typically are ignored, but this depends on the location. If, for example, the footer is higher or lower than the import tool expects, it might be included in the import.
  • Tables are imported, but complex, multi-dimensional tables might not be structured well.
  • Images and other asset URLs aren’t imported.
  • Most HTML formatting isn't preserved, so you'll need to manually reapply it after verifying the import was successful. Learn more farther below.

Import Web pages

  1. When you add the knowledge base, for Import content from, select “Web pages.” Then select Enter URL or Upload from computer.

    The Add knowledge base dialog with a callout to the Import content from setting

  2. If you selected Upload from computer, upload up to 10 files. Alternatively, if you selected Enter URL, add up to 10 URLs that are publicly accessible. In either case, specify Web pages that are as focused as possible. The closer they relate to the scope of the knowledge base, the better. Examples include FAQ pages, glossaries, blogs, and other articles.

    When entering a URL: 1) Enter the full URL. Defining URL patterns to match a range of URLs isn’t supported. That is, don’t use an asterisk to try to match any characters after that position in the URL. 2) Remember that the URL must be publicly accessible, as stated above. 3) Select a URL depth. This determines how far down the URL hierarchy to crawl for valid content. URL Depth is described in detail farther below.

    If you have more than 10 URLs to import, you can. But you can only import up to 10 at time. Use the Sources page to import subsequent batches.

  3. Click Save.

    This starts the asynchronous import process, which takes some time depending on the URL depth and content.

More about URL depth

URL Depth is best explained by way of an example. Assume the following:

  • URL = http://www.acmetelco.com/faqs
  • URL depth = 0
  • Result =
    • Imports the contents of the FAQs page only, i.e., http://www.acmetelco.com/faqs.
    • The contents of pages accessed via links on the FAQs page are ignored.

An architectural diagram describing the results when the URL depth is zero

  • URL depth = 1
  • Result =
    • Imports the contents of the FAQs page: http://www.acmetelco.com/faqs.
    • Imports the contents of any URL linked directly from the FAQs page if that link is considered valid, i.e., if it starts with http://www.acmetelco.com/faqs.

    If the FAQs page contains a link to http://www.acmetelco.com/faqs/packages, it’s imported too. But neither http://www.acmetelco.com/support nor http://www.acmetelco.com/packages/premier are imported.

An architectural diagram describing the results when the URL depth is one

  • URL depth = 2
  • Result =
    • Imports the contents of the FAQs page.
    • Imports the contents of valid URLs found when depth = 1.
    • Imports the contents of valid URLs found in the pages found when depth = 1.

An architectural diagram describing the results when the URL depth is two

Overall, keep in mind that the URL depth isn’t the number of URL “parts” after the host. It’s the number of clicks that it takes to get to it from the submitted page, assuming that the URL starts with the relevant prefix.

Be aware that it might not be necessary to crawl as deep as the specified URL depth to import all of the content; this depends entirely on the contents (the presence or absence of links) on the pages. The higher the URL depth, the longer it takes to perform the import.

Monitor import progress

The Sources page shows you detailed info about the progress and results:

The Sources page showing the status indicators for a completed import

The articles created are enabled by default; however, it’s a good idea to review them to ensure the results are as you expect.

Verify the import

After the import is completed, use the Sources page to check the results and determine if there were any errors or any failed imports. There are 3 views of your sources:

  • All: Lists all sources, including canceled ones.
  • Success: Lists sources where imports were fully or partially successful.
  • Failed: Lists sources where imports failed; no articles were created.

The tabs for the three views of an imported source

For successful imports, verify the article title/answer pairs are as you expect and desire, and that there’s no missing content.

Post-import tasks

During the import process, <a> tags (anchor tags) are preserved. However, other HTML formatting isn't preserved, so you'll need to manually reapply it after verifying the import was successful. Keep in mind that KnowledgeAI™ supports only a subset of HTML.

Troubleshooting Web page imports

If you encounter errors, use the following info as a guide to help you resolve them:

Error name Description How to fix
Process timed out The page can’t be loaded. If it’s a valid URL, contact LivePerson Support.
Page not found The page can’t be found. Try the URL. If it’s valid, contact LivePerson Support.
No content to import Nothing on the page is relevant content-wise. Try another URL. Or, contact LivePerson Support if the URL is valid.
Content couldn’t be fetched The contents of a URL could not be obtained, where the URL is either the one you specified or one found during crawling the URL hierarchy based on the URL depth.

If the problematic URL is public and accessible to you, then typically this issue is due to the import tool being blocked by your site servers.
Sometimes the issue goes away given some time (minutes to hours depending on your policies). Otherwise, contact LivePerson Support, so we can facilitate whitelisting of our service by your site.
Too many tasks to handle The URLs specified and found during import led to too many concurrent tasks, and the import process timed out.

This error is thrown instead of overloading your site with too many requests. Instead, every task waits its turn. But when there are too many tasks, a task might time out while waiting.
Reduce the load: Decrease the number of specified URLs and/or their URL depths.
Processing error Something went wrong behind the scenes. This is usually an issue on our end. Contact LivePerson Support.
Too many URLs There are too many URLs to import. Reduce the URL depth and try again.

If the Web page import just won't work, try saving the HTML page as a PDF, and then import the PDF. Alternatively, add the content using a more manual method, such as a CSV import or direct add.

Import content from PDFs

You can import PDFs into a knowledge base. You can do this all at once when you create the knowledge base, or you can import some PDFs then and others later.

Important notes

  • You can only import PDFs if you've activated our Generative AI features. If your interest in Large Language Models (LLMs) and Generative AI is piqued, don’t hesitate to get started with LivePerson’s trustworthy solution. We continue to roll out new features and enhancements that support safe, responsible, and equal AI. Join us on the journey.
  • You can import PDFs that are stored on your computer, or PDFs stored in a publicly accessible location (Google drive, Web site, etc.).
  • If you import PDFs that are in a publicly accessible location, make sure the contents are retrievable via the command-line “web get” or wget utility and then viewable in a PDF viewer. After the import is completed, you can restrict access to the file if desired.
  • New articles created from the PDFs are enabled by default, so ensure the content is suitable before you import.

Best practices

The great news about importing a PDF is that you likely have control over the formatting and can modify the document for best results.

  • Keep the document’s design as simple as possible.
  • Format it so that each article’s title is in a larger font size than the article’s answer. Or, use different font styles for each. If the title and the answer use the same font size and style, the two pieces of content can’t be differentiated by the import tool.
  • Separate the title from the answer by a new line.

Here’s an example of a well-formatted PDF:

A PDF that adheres to the documented best practices

Also, review the articles after the import to ensure they’re as you expect and desire.

Limitations

  • PDFs larger than 150 MB can’t be imported.
  • Encrypted or password-protected PDFs can’t be imported.
  • Not supported:
    • Multi-column PDFs
    • Complex PDFs with an unpredictable organizational design
    • Import of tables, images, and hyperlinks
  • Most HTML formatting isn't preserved, so you'll need to manually reapply it after verifying the import was successful. Learn more farther below.
  • Don’t convert a web page to a PDF, and import the PDF. Import the web page itself. Conversions like this aren’t supported as they result in unpredictable formatting that isn’t ingested well.

Import PDFs

  1. When you add the knowledge base, for Import content from, select “PDF.” Then enter the URLs for up to 10 PDFs, or upload them from your computer.

    Add knowledge base dialog with a callout to the Import content from setting

    If you have more than 10 PDFs to import, you can. But you can only import up to 10 at time. Use the Sources page to import subsequent batches.

  2. Click Save.

    This starts the asynchronous import process, which takes some time depending on the size of the PDFs.

Monitor import progress

The Sources page shows you detailed info about the progress and results:

The Sources page showing the status indicators for a completed import

The articles created are enabled by default; however, it’s a good idea to review them to ensure the results are as you expect.

Verify the import

After the import is completed, use the Sources page to check the results and determine if there were any errors or any failed imports. There are 3 views of your sources:

  • All: Lists all sources, including canceled ones.
  • Success: Lists sources where imports were fully or partially successful.
  • Failed: Lists sources where imports failed; no articles were created.

The tabs for the three views of an imported source

For successful imports, verify the article title/answer pairs are as you expect and desire, and that there’s no missing content.

Post-import tasks

During the import process, <a> tags (anchor tags) are preserved. However, other HTML formatting isn't preserved, so you'll need to manually reapply it after verifying the import was successful. Keep in mind that KnowledgeAI supports only a subset of HTML.

Troubleshooting PDF imports

If you encounter errors, use the following info as a guide to help you resolve them:

Error name Description How to fix
File size too large The PDF is larger than 150 MB. Divide the PDF into smaller ones, and import those.
Incorrect file type The file is not of type PDF. Create a file of type PDF and use that.
Processing error Something went wrong behind the scenes. This is usually an issue on our end. Contact LivePerson Support to report the issue.

Import content from a Google sheet

To import content from a Google sheet, the sheet must be public, i.e., with no file restrictions in place. For details on creating the file, see this section farther below.

When creating one knowledge base based off of another, don't reuse the same Google sheet for a second knowledge base in the same hosted region. The article IDs must be unique within the region. In the file for the second knowledge base, clear the article IDs; the application will create article IDs for new articles.

Import content from a CSV file

For details on creating a CSV import file, see this section farther below.

If the system finds errors in the CSV file you are importing, you’re notified of this and offered a Download errors button. Use the button to download info on the errors that were found in the import file. Make sure your browser is configured to allow pop-ups, or you won’t be able to complete the download.

When creating one knowledge base based off of another, don't reuse the same CSV file for a second knowledge base in the same hosted region. The article IDs must be unique within the region. In the file for the second knowledge base, clear the article IDs; the application will create article IDs for new articles.

Create an import file (CSV or Google sheet)

If you want to import a set of articles into a knowledge base when you add the knowledge base, you'll need to create the import file.

An example of a well-formed import file

The import file can contain a subset of HTML, and it should adhere to these limits. Additionally, as a best practice, ensure the file is saved as a UTF-8 encoded CSV file before you import it. This is particularly important if you need to support special language characters (e.g., ö, ü, ß).

To create an import file

  1. Create a new CSV file or Google sheet. A Google sheet must be public, i.e., with no file restrictions in place.
  2. Add the column headers listed below; use the order listed in the table below.
  3. Fill out the rows with your article data. It's recommended that you complete at least these columns: title, summary, detail, tags, and alternates (if using Knowledge Base intents) or intentName (if using Domain intents).

Column headers

Column header name Description
id A String; a unique ID assigned to an article.

This column isn't required when you initially create the knowledge base. However, if you're using a Google sheet that you plan to sync periodically, it does play a role then. Before performing a sync, update the Google sheet to include the "id" column and enter the IDs for all existing articles.

When creating one knowledge base based off of another, don't reuse the same CSV import file or Google sheet for a second knowledge base in the same hosted region. The article IDs must be unique within the region. In the file for the second knowledge base, clear the article IDs; the application will create article IDs for new articles.
tags A comma-separated list of relevant keywords. These highlight the key noun(s) or word(s) in the training phrases. For example, for an article about health insurance, the tags should be "health", “insurance”, “benefits”. These should be words, not sentences.
title The article title. This should be a complete sentence or question that the user might ask. Adhere to best practices.
summary A short response or message to be sent to the user. You can include web links, although depending on the channel they might not display correctly. For SMS/Messaging, you might need to show the URL by itself, not wrapped in HTML, since the HTML will be sent as plain text over these channels.
alternates Applicable if you're using Knowledge Base intents, not Domain intents. In the UI, these are called "intent qualifiers." Intent qualifiers are alternative ways that people ask for the article, i.e., alternative ways to communicate the same intent. Adhere to best practices.
detail A longer message to the user. For messaging, it's recommended that you keep the responses as brief as possible.
content_url The URL of a hyperlink. For info on usage, see this section.
image_url The URL of an image. For info on usage, see this section.
audio_url The URL of an audio file. For info on usage, see this section.
video_url The URL of a video file. For info on usage, see this section.
category Assigning a category lets you filter and find articles based on categories in the KnowledgeAI application.
intentName Applicable if you're using Domain intents, not Knowledge Base intents. This is the intent associated with the article.
validFrom Specify the date and time on which the article becomes active in Epoch time in milliseconds.
validTo Specify the date and time on which the article becomes inactive in Epoch time in milliseconds.

Update content

When updating content in knowledge bases, be mindful that you can’t mix content types within a knowledge base.

Content type How updates are made
Web pages In the knowledge base, use the Sources page to import additional Web pages, or to re-import a Web page.

If the import process finds a match by article title, the contents of that article are overwritten. As you might expect, new articles are created when necessary. But you must manually delete articles that you no longer need.
PDFs In the knowledge base, use the Sources page to import additional PDFs, or to import a modified version of the same PDF.

If the import process finds a match by article title, the contents of that article are overwritten. As you might expect, new articles are created when necessary. But you must manually delete articles that you no longer need.
Google sheet In the knowledge base, use the Sources page to sync the knowledge base with the contents of the Google sheet, or to replace the sheet with another.

Syncing overwrites the content in the knowledge base with that in the sheet, so adds, updates, and deletes do occur. New articles are enabled by default, so ensure the content is suitable before you sync.

Replacing the sheet means replacing all the content with that in a different sheet. Articles will be added, updated, and deleted, all based on the contents in the new sheet. Here again, new articles are enabled by default, so ensure the content is suitable before you replace the sheet.
CSV file Perform updates manually within the knowledge base. Or, use the Sources page to replace the CSV file.

Replacing the file means replacing all the content in the knowledge base with that in a different file. Articles will be added, updated, and deleted, all based on the contents in the new file. New articles are enabled by default, so ensure the content is suitable before you replace the CSV file.

Delete content

You can delete a knowledge base's source at any time, but be aware that this permanently deletes all of the articles that were created using the source.

The Sources page within a knowledge base, with a callout to the Delete action link for a source

Integrating with an external CMS/KMS? Learn more about deleting content.