You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<li>By default, it is named <code>{root_directory.name}_index.csv</code>.</li>
1909
-
<li>You can customize the filename or provide an absolute path for more control. </li>
1910
-
<li>This is something that the <code>save()</code> method in the <code>AbstractBaseWriter</code> should <strong>optionally</strong>
1911
-
implement, or let users decide to include the index by calling <code>add_to_index(path)</code>after <code>save()</code>.</li>
1908
+
<li>The AbstractBaseWriter now uses the powerful <code>IndexWriter</code> class to handle all index operations</li>
1909
+
<li>By default, the index file is named <code>{root_directory.name}_index.csv</code></li>
1910
+
<li>You can customize the filename or provide an absolute path for more control</li>
1911
+
<li>When implementing a writer class, call <code>add_to_index(path)</code>in your <code>save()</code> method to record saved files</li>
1912
1912
</ul>
1913
1913
<p><strong>Key Features</strong>:</p>
1914
1914
<ul>
1915
1915
<li><strong>Customizable Filename</strong>: Use <code>index_filename</code> to set a custom name or absolute path.</li>
1916
-
<li><strong>Absolute/Relative Paths</strong>: Control file paths in the index with <code>absolute_paths_in_index</code>.</li>
1917
-
<li><strong>Inter-process Locking</strong>: Prevents conflicts in concurrent writing environments.</li>
1916
+
<li><strong>Absolute/Relative Paths</strong>: Control file paths in the index with <code>absolute_paths_in_index</code> (defaults to relative).</li>
1917
+
<li><strong>Schema Evolution</strong>: Control schema evolution with the <code>merge_columns</code> parameter when calling <code>add_to_index()</code>.</li>
1918
+
<li><strong>Safe Concurrent Access</strong>: Uses inter-process locking for thread-safe operations in multi-process environments.</li>
1919
+
<li><strong>Robust Error Handling</strong>: Specific exceptions for index-related errors to help troubleshoot issues.</li>
1920
+
</ul>
1921
+
<p><strong>Using the add_to_index Method</strong>:</p>
1922
+
<divclass="highlight"><pre><span></span><code><aid="__codelineno-8-1" name="__codelineno-8-1" href="#__codelineno-8-1"></a><spanclass="c1"># In your writer's save method:</span>
<aid="__codelineno-8-7" name="__codelineno-8-7" href="#__codelineno-8-7"></a><spanclass="c1"># Record this file in the index, with optional parameters:</span>
<aid="__codelineno-8-10" name="__codelineno-8-10" href="#__codelineno-8-10"></a><spanclass="n">include_all_context</span><spanclass="o">=</span><spanclass="kc">True</span><spanclass="p">,</span><spanclass="c1"># Include all context variables, not just those used in the filename</span>
1932
+
<aid="__codelineno-8-11" name="__codelineno-8-11" href="#__codelineno-8-11"></a><spanclass="n">filepath_column</span><spanclass="o">=</span><spanclass="s2">"path"</span><spanclass="p">,</span><spanclass="c1"># Name of the column to store file paths</span>
1933
+
<aid="__codelineno-8-12" name="__codelineno-8-12" href="#__codelineno-8-12"></a><spanclass="n">replace_existing</span><spanclass="o">=</span><spanclass="kc">False</span><spanclass="p">,</span><spanclass="c1"># Whether to replace existing entries for the same file</span>
1934
+
<aid="__codelineno-8-13" name="__codelineno-8-13" href="#__codelineno-8-13"></a><spanclass="n">merge_columns</span><spanclass="o">=</span><spanclass="kc">True</span><spanclass="c1"># Whether to allow schema evolution</span>
<p><strong>Schema Evolution with merge_columns</strong>:</p>
1940
+
<p>The <code>merge_columns</code> parameter (defaults to <code>True</code>) controls how the IndexWriter handles changes to your data schema:</p>
1941
+
<ul>
1942
+
<li><strong>When <code>True</code></strong>: If your context has new fields that didn't exist in previous CSV entries, they'll be added as new columns. This is great for:</li>
1943
+
<li>Iterative development when you're adding new metadata fields</li>
1944
+
<li>Different processes writing files with slightly different context variables</li>
1945
+
<li>
1946
+
<p>Ensuring backward compatibility with existing index files</p>
1947
+
</li>
1948
+
<li>
1949
+
<p><strong>When <code>False</code></strong>: Strict schema enforcement is applied. The IndexWriter will raise an error if the columns don't match exactly what's already in the index file. This is useful when:</p>
1950
+
</li>
1951
+
<li>You want to enforce a consistent schema across all entries</li>
1952
+
<li>You're concerned about typos or unintended fields creeping into your index</li>
1953
+
<li>Data consistency is critical for downstream processing</li>
0 commit comments