Class: HTMLReader
Extract the significant text from an arbitrary HTML document. The contents of any head, script, style, and xml tags are removed completely. The URLs for a[href] tags are extracted, along with the inner text of the tag. All other tags are removed, and the inner text is kept intact. Html entities (e.g., &) are not decoded.
Extends
Constructors
new HTMLReader()
new HTMLReader():
HTMLReader
Returns
Inherited from
Methods
getOptions()
getOptions():
object
Wrapper for our configuration options passed to string-strip-html library
Returns
object
An object of options for the underlying library
skipHtmlDecoding
skipHtmlDecoding:
boolean
=true
stripTogetherWithTheirContents
stripTogetherWithTheirContents:
string
[]
See
https://codsen.com/os/string-strip-html/examples
Defined in
packages/llamaindex/src/readers/HTMLReader.ts:41
loadData()
Parameters
• filePath: string
Returns
Inherited from
Defined in
packages/core/schema/dist/index.d.ts:187
loadDataAsContent()
loadDataAsContent(
fileContent
):Promise
<Document
<Metadata
>[]>
Public method for this reader. Required by BaseReader interface.
Parameters
• fileContent: Uint8Array
The content of the file.
Returns
Promise<Document[]> A Promise object, eventually yielding zero or one Document parsed from the HTML content of the specified file.
Overrides
Defined in
packages/llamaindex/src/readers/HTMLReader.ts:16
parseContent()
parseContent(
html
,options
):Promise
<string
>
Wrapper for string-strip-html usage.
Parameters
• html: string
Raw HTML content to be parsed.
• options: any
= {}
An object of options for the underlying library
Returns
Promise
<string
>
The HTML content, stripped of unwanted tags and attributes
See
getOptions
Defined in
packages/llamaindex/src/readers/HTMLReader.ts:31
addMetaData()
static
addMetaData(filePath
): (doc
,index
) =>void
Parameters
• filePath: string
Returns
Function
Parameters
• index: number
Returns
void
Inherited from
Defined in
packages/core/schema/dist/index.d.ts:188