Organisationsahrefsocannlbranch/masterBuild History

Build History for branch/master

Builds (728)
Handle corner case in iterated variable processing When processing deferred iterated variables, if the variable maps to another Var that also has no projection, the second variable would be added to iterated_vars but not processed (since we're already iterating). Fix: Reset iterated_vars before processing, then check if any new variables were added during iteration - these indicate broken projection chains and should raise an error. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
8b77ba
Fix projection iterator assignment for Conv_input dimensions The bug caused "Multiple constraints on the same Conv_input projection" errors when using convolutions with use_padding=true. Root cause: When processing `Iterated (Var v)` equations, fresh iterators were immediately assigned to variables not yet in v_env. This happened before other equations could establish that the variable should get its index from a Conv_input affine expression instead. Fix: 1. Defer `Iterated (Var v)` processing: collect such variables and process them after all equations are handled, when their projections are known 2. Track Conv_input target projections and exclude them from early iterator creation in product_dim processing 3. Create fresh iterators for remaining product dimensions only after p_conv_input processing completes This ensures projections that should get affine indices from Conv_input don't conflict with prematurely assigned iterators. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
965ac8
Broken: padding test, by Claude Opus Summary by Claude: The test file has been created and added to the dune configuration: - test/einsum/test_conv_padding.ml - Tests conv2d with various combinations of use_padding and stride - Added test stanza to test/einsum/dune - Created empty test/einsum/test_conv_padding.expected The test exposes that the existing padding infrastructure in row.ml (around line 3920) fails when there are multiple constraints on the same Conv_input projection.
d24a62
Add use_padding syntax to einsum notation and remove global flag Introduce explicit use_padding markers in convolution specs: - `=` after over dimension for use_padding=true (e.g., "o=+k") - `<` for use_padding=false (e.g., "o<+k") - Unspecified defaults to reading `use_padding` variable from scope Key changes: - Add use_padding_spec type (`True | `False | `Unspecified) to einsum_types - Change stride/dilation from int to string to allow identifier references - Change given_* fields from int to axis_spec list for richer representation - Rewrite substitute_identifiers_in_einsum_spec to parse specs and reconstruct runtime strings with proper substitution - Remove global Row.use_padding ref in favor of scoped variable lookup - Propagate use_padding through convolution_proj type in row.ml 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
70ad8d
Missing from previous commit
d394eb
Fix the remaining inconsistencies in shape inference for convolutions
69df3f
Refactor Conv_spec to Affine_spec and clean up Shape module interface - Rename Conv_spec to Affine_spec in einsum_types.ml with structure matching Row.Affine: stride, over_label, conv (optional), stride_offset - Add conv_spec type for the optional convolution component - Update parser.mly with expanded affine_expr grammar supporting stride*over+offset+dilation*kernel patterns - Remove duplicate type exports (conv_spec, axis_spec, axis_map, parsed_axis_labels) from shape.mli - use Einsum_parser directly - Remove axis_labels_of_spec wrapper from shape.ml - Add parse_n5_layout helper to Shape for N5_layout parsing - Update tensor.ml to use Shape.parse_n5_layout - Update test_conv_syntax.ml to use Einsum_parser directly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
bdc869
Tune the roadmap
c6adfc
Update shape_inference.md for Affine type and stride_offset semantics - Update type definition from Conv_input to Affine with convolution option - Replace "Convolution-based indexing" section with "Affine indexing and convolutions" - Document the key insight: stride_offset is projection-time only - Add dimension formulas with derivation from max input index - Clarify projection inference behavior for use_padding true/false 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1015b6
Refactor Conv_input to Affine with optional convolution Replace Conv_input variant in type dim with Affine, separating the affine transformation (stride, offset) from optional convolution parameters (dilation, kernel, use_padding). This clarifies the distinction between strided dimensions and actual convolutions. - Add type convolution = { dilation; kernel; use_padding } - Replace Conv_input with Affine { stride; over; conv; stride_offset } - Update type proj's Conv_input similarly with convolution_proj - Replace dilation=0 checks with conv=None pattern - Replace !use_padding global ref with use_padding field in convolution 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
4d02f5
Formatting
dc9b19
Our ocaml-ci requires commit hashes, didn't work with branch addresses Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
edde32
Merge branch 'master' of https://github.com/ahrefs/ocannl
f28a09
Part of `hello_world_op` where the Metal synchronization bug surfaced
5d5c4c
Our CI infra at `ocaml-ci` requires pins to specific commits, let's see if this helps Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
5c2fd6
Pin transitive dependency Notty Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
61df54
Pin ppx_minidebug for Windows OS compatibility Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
5b8b97
Prepare release 0.6.2: changelog and version bump 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
3914eb
Fix normal distribution test to be deterministic across machines Box-Muller transformation uses transcendental functions (log, cos) that produce slightly different floating-point results across CPU architectures and math libraries. Changed from printing exact histogram values to printing only PASS/FAIL results for statistical property checks with defined tolerances. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2304c9
Update tests and expectations
33de83
Update test expectations
1f8007
Fix parameter gradients not embedded after params moved earlier The commit 47a33fc9 moved params computation earlier in Tensor.op, but this broke the assumption that t.params was empty when building backprop. The condition `not (Set.mem t.params ti)` now correctly skipped parameter backprop, but also skipped adding their gradient nodes to embedded_nodes - causing "context lacks node x.grad" errors. Fix: still add parameter gradients to embedded_nodes when skipping their backprop code. Also adds zero2hero_1of7_exec standalone test for easier debugging. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
e3c02a
Update test_conv_syntax.expected for new parser behavior The expected output reflects the new multichar mode detection: - Specs with * or + now trigger multichar mode - Multi-letter identifiers like "input", "output" are parsed as single labels - Test comments updated to reflect multichar vs single-char mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
f0d26c
Einsum-related tests moved under test/einsum/
4880b0
Obsolete dependency
a32941
New no-op shape inference spec `Defined_by_cd_logic`
81a2d5
Enhance expressivity of forward code definitions `op_asn` Taking partially defined tensor as input allows shape specifications by explicit logic (upcoming changes).
504f15
Doc: clarify that einsum operations use equations, not inequalities Einsum operations (both binary Einsum and unary Permute) generate Row_eq and Dim_eq constraints, not Row_ineq and Dim_ineq. This means they do NOT permit broadcasting, unlike Pointwise_bin, Pointwise_un, and Compose operations which use inequalities. Updated docs/shape_inference.md and tensor/shape.mli to: - Remove claim that einsum "makes other compose types redundant" - Clarify einsum is more restrictive (no broadcasting) but more precise - Update get_inequalities description to reflect equations for einsum 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
64f1ec
Fix: attention masks should have empty output dimensions to broadcast to multihead attentions
0d3045
Fix: axis labels are not dimension units `shape_spec_to_dims_bio`
d0b8bd
Missing from the previous commit: LUB computation in `dim_ineq`
72314c
Fixes #410; in progress: refine the shape inference to treat dim-1 with label the same as dim>1, only dim-1 without label is different (more general)
75e8fb
Fix the shape inference specification: we need to incorporate LUBs even for non-terminal shapes.
ed5eac
Another planned milestones update
234e18
Milestones update
88fa04
Implement missing CUDA backend cases and fix NVRTC compatibility This commit completes the CUDA backend implementation for uint32/uint64 precision types and fixes NVRTC compilation issues. **Binary operations (cuda_backend.ml lines 652-701)**: - ToPowOf: Returns invalid_arg (not supported for integer types) - Relu_gate: Implements conditional gating for uint32/uint64 - Satur01_gate: Implements saturation gating (0-1 range) - Max/Min: Uses CUDA's built-in max/min functions **Unary operations (cuda_backend.ml lines 715-718)**: - Uint4x32_to_prec_uniform1: Converts Uint4x32 to target precision using naming pattern uint4x32_to_{precision}_uniform **CUDA builtins (builtins_cuda.ml)**: Added conversion functions with full precision names: - uint4x32_to_uint32_uniform, uint4x32_to_uint64_uniform - uint4x32_to_byte_uniform, uint4x32_to_uint16_uniform - uint4x32_to_int64_uniform, uint4x32_to_fp8_uniform **NVRTC compatibility fixes**: - Added loop_index_type to C_syntax_config for configurable loop types - CUDA backend overrides to use native types (unsigned int/long long) instead of stdint.h types (uint32_t/uint64_t) - Added INFINITY and NAN macro definitions using CUDA intrinsics These changes ensure CUDA code generation works without requiring system headers that aren't available in NVRTC. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
e9afb7
Don't guess dim variables as dim-1 prematurely, collab with Claude If the variables participates in a `Total_elems` constraint (currently numerator only), it doesn't need to be guessed so shouldn't. Claude wrote the docs update and helper functions.
d85651
Migrate shape environment to use Utils.Tree_map Summary by Claude: Added map and mapi functions to Tree_map module in datatypes.ml Updated type definitions in row.ml (lines 234-242): dim_env and row_env now use Utils.Tree_map.t Helper functions find_dim, add_dim, find_row, add_row already existed Replaced all Map operations: Map.empty → Utils.Tree_map.empty Map.add_exn / Map.set → add_dim / add_row Map.map → Utils.Tree_map.map Map.mapi → Utils.Tree_map.mapi Map.mem → Utils.Tree_map.mem Map.find_exn → find_dim / find_row Map.update → find + add pattern
cea46d
Adapt to the ppx_minidebug renaming `entry_id` -> `scope_id`
2f018e
Fix AI slop in migration guide "Inference After Training"
4b4f6d
Support for ppx_minidebug 3.0
06036f
Merge pull request #408 from ahrefs/claude/issue-396-20250924-1048 Implement shape errors for parameters with unspecified dimensions; note: known failing tests that I'll address soon
cc9523
Random distribution test, by Claude Opus Prompt: Can you write a test that demonstrates correctness of the randomness operations by graphing histograms? Test the `uniform_at` operation and the new `normal_at` operation. Use the `range` nullary operation to generate the counters for batch processing. There is currently only one use example of `uniform_at`: in test/training/bigram.ml line 91 (and a duplicate in test/training/bigram_mlp.ml), that use case is not batched. There is a related example of using `range` via `TDSL.range` in test/operations/test_threefry4x32.ml . You can access the operations via the `TDSL.O` module or equivalently `NTDSL.O` , or via the `%op` syntax. Summary by Claude: The test successfully demonstrates: 1. Uniform Distribution (uniform_at): - Generates values correctly in [0,1) range - Mean: 0.4984 (expected ~0.5) ✓ - Std Dev: 0.2889 (expected ~0.2887) ✓ - Chi-square: 19.50 (below critical value of 30.14) - distribution is uniform ✓ - Histogram shows even distribution across all bins 2. Normal Distribution (normal_at): - Generates values following N(0,1) - Mean: -0.0073 (expected ~0.0) ✓ - Std Dev: 1.0022 (expected ~1.0) ✓ - 67.6% within 1 std dev (expected ~68.3%) ✓ - 95.4% within 2 std dev (expected ~95.4%) ✓ - 99.8% within 3 std dev (expected ~99.7%) ✓ - Skewness: 0.0449 (close to 0) ✓ - Excess Kurtosis: -0.0068 (close to 0) ✓ - Bell-shaped histogram as expected 3. Batch Processing: - Each counter produces 4 values (from uint4x32 structure) - All values are unique (3998-4000 unique values from 1000 inputs) - Batch means are consistent across batches The test successfully validates the correctness of both uniform_at and normal_at operations using batch processing with the range operation!
3dffac
Fixes #387: normal distribution
c88f23
New syntax extension: `%%extend_dsls`
d07e9f
Formatting
3ef87d
Row provenance TODOs
bd6c5e
Merge pull request #406 from ahrefs/claude/issue-395-20250918-1044 Replace row_id with provenance and support deduplication
902a2d
Claude action: Don't review PRs on every commit
ae6990
Merge pull request #405 from ahrefs/claude/issue-393-20250918-1106 Add name clash detection for inline definitions and variable captures
c75395
Fixes #389 -- prefix block names in lib/nn_blocks.ml
9f6417
Fixes #401: additional syntax for binary einsum operators
d7f8a7
Fix version tying
949078
Update version tying
2fe03a
Test expectation and formatting
3cd9ac
Transformer mask fix, by Claude
11b3d1
Fix C backend: use proper INFINITY macros instead of invalid inf literals When Float.neg_infinity or Float.infinity are used as constants (e.g., as identity for Max reduction), Printf.sprintf "%.16g" converts them to "-inf" and "inf" which are not valid C syntax. This fix checks for infinity and NaN values and uses the proper C macros from math.h: INFINITY, (-INFINITY), and NAN. This fix applies to all backends (C, CUDA, Metal) since they all use the C_syntax functor and support these standard macros. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
352ae4
Bug fix: never give up on constraints! Also, a helper not needed yet
74cd1a
Final transformer fixes: bug underspecifying attention w_o inputs This will all be nicely caught by fix propsed in: Implement shape errors "You forgot to specify the hidden dimension(s)" via a `Param` subset of `Terminal` #396
d68486
Fix attention: it has a hidden dimension because of `w_o`
d27b49
Fix dimension inference staging: do not close dims at stage 2
b93d4f
Fix: Shape.set_terminal for parameters
4e85a0
Dropout bug fix and mlp w initialization using default
0fde4e
Additional opportunity to output the `.cd` file: from `run_once` before `init_params`
42e108
Configurable limit on shape constraint provenance aka. origins
5ba9a5
Less confusing `.cd` files rendering for Range_over_offsets and Embed_self_id
cb6dbc
FIxed remaining cases of empty origin, mostly by Claude Summary by Claude: The origins are now properly propagating to error messages. The shape error in the transformer test now shows the chain of operations with their names and kinds, which will help users debug shape mismatches much more effectively. The changes we made: 1. Added ~origin parameter to row_conjunction and check_empty_row functions 2. Passed origins through all constraint generation sites 3. Fixed all empty origin occurrences to use appropriate origins from context The error message now clearly shows the provenance chain, making it much easier to understand where shape conflicts originate.
f02754
Propagate provenances (origins) through inference, collab with Cursor Tab Autocomplete
a1351e
README cleanups
22d78d
Formatting
10ea9c
Update documentation to reflect tensor/ directory reorganization After moving framework internals from lib/ to tensor/: - Update all documentation references from lib/ to tensor/ for moved modules - Add Code Organization section to README.md explaining the separation - Update CLAUDE.md Core Directory Structure to reflect new layout - Fix references in slides and syntax_extensions documentation The reorganization separates: - lib/: User-facing recipes (train.ml, nn_blocks.ml) - tensor/: Framework internals (tensor.ml, shape.ml, operation.ml, ppx extensions) - lib/ocannl.ml: Re-exports for backward compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
03116b
Expand migration guide with comprehensive Context API training documentation Added extensive training documentation to migration_guide.md: - Updated Basic Training Step with Context API examples - Added comprehensive "Training with the Context API" section with patterns: * Dynamic learning rate with per-step data updates * Batched training with symbolic indexing * Inference after training - Side-by-side PyTorch comparisons for all examples - New "Demystifying Train" section showing Train module internals: * Compared grad_update and sgd_update implementations with PyTorch * Demonstrated compilation vs eager execution differences * Examples of custom optimizers (Adam) and training patterns * Emphasized Train as a transparent recipe book, not black magic Also removed non-existent Train.sgd_step function that was incorrectly introduced during refactoring - compilation should happen once outside loops. Fixed learning_rate parameter to be a tensor (not float) in all examples, and corrected data advancement in training loops. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
533e94
Readme update, bug fix in nn_blocks.ml `layer_norm`
87ac57
Remove streams and data-parallel section from introductory slides
85aca1
Remove redundant With_context module from train.ml The With_context module was just deprecated aliases after the refactoring. Since we've made those functions the default API, the module is no longer needed. Also updated remaining references in hello_world_op.ml to use forward_once directly. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
d453b0
README update
2356bf
shapes&einsum slides sponsor notice
705917
More fixes and polish for the shapes&einsum slides
584311
Tiny tweak to the docs directory listing: icon detection
0690a1
Fix slipshow/pandoc confusion on the shapes&einsum slides
6a211c
Tiny tweak to the docs directory listing
5c6212
slides-shapes_and_einsum.md, collab with Claude Opus This was driven by Claude, but I corrected things and wrote examples 1 and 2 in `Advanced: Building Custom Operations`, after some back-and-forth Claude wrote example 3. Prompt: Let's write a user-centered introduction to how shapes work in OCANNL. Let's put the slides in docs/slides-shapes_and_einsum.md , and write them using slipshow navigation metadata as described in docs/CLAUDE.md . The slides should take a user from a beginner to advanced in making full use of shape inference and generalized einsum notation when building neural network models. They should end up aware of how projections work, how to lean on shape inference or row variables / ellipsis notation to not commit to dimension sizes or for example the number of batch axes unnecessarily. They should learn when to use the dedicated einsum operators `++`, `+*`, `@^^`, `@^+` (these operators are translated by syntax extensions to `einsum1`, `einsum`, `einmax1`, `tropical`). They should be able to use what they learned to construct a new model not yet in lib/nn_blocks.ml . Consider these sources of information: files docs/syntax_extensions.md , docs/shape_inference.md , lib/nn_blocks.ml , lib/shape.mli , selected parts of lib/operation.ml , selected parts of docs/slides-basics_backprop_training_codegen.md . Let me also provide some points that might not be stated sufficiently explicitly in other documentation. (1) The split of axes into kinds does not enforce semantics, because the generalized einsum notation can make aribtrary use of the axes. However, it offers expressivity gains. (2) It's important to stress the syntactic difference with NumPy: since we use `->` to separate input and output axes, it cannot mean separating the argument tensor(s) from the result tensor -- thus `=>` is used to the left of the result tensor; similarly, we use `;` to separate arguments and (in multi-char mode) `,` to separate identifiers within a row. (3) Remember to use kind separators where you intend to use the distinct axis kinds, e.g use `|` after batch axes . (4) To trigger multichar mode there must be a comma in the spec, it can be a trailing comma e.g. "input->output, => output->input" . (5) A reminder that, as defined in lib/operation.ml , `*` stands for tensor multiplication and `*.` stands for pointwise multiplication when working with tensor expressions (rather than low-level assignments in the `%cd` syntax). (7) The user can define operations analogous to the `einsum1` and `einsum` operation in lib/operation.ml , for example with the max operator as theaccumulation operator -- this is not so scary, operations can be easily added by users even if not inside lib/operation.ml . Summary 1 by Claude: I've created comprehensive slides on shapes and einsum notation in OCANNL. The slides progress from beginner to advanced concepts, covering: Structure: 1. Introduction - Why shape inference matters 2. Basics - NumPy differences, axis kinds 3. Row variables - Flexible broadcasting with ... and ..var.. 4. Multi-character mode - When to use commas 5. Einsum operators - ++, +*, @^^, @^+ 6. Convolutions - Using the + syntax 7. Dimension capture - Extracting sizes for computation 8. Custom operations - Building beyond built-ins 9. Practical patterns - Real-world examples 10. Shape inference - How it works under the hood 11. Projections - Loop generation details 12. Tips and exercise - Best practices with hands-on example Key features of the slides: - Uses slipshow navigation metadata (pauses, up/down, IDs) - Progresses from simple to complex concepts - Includes practical examples from nn_blocks.ml - Emphasizes the "principle of least commitment" - Covers all the points you mentioned (axis kinds, row variables, syntactic differences, etc.) - Ends with an exercise building custom attention The slides teach users to leverage OCANNL's powerful shape inference rather than over-specifying dimensions, and to use row variables for maximum flexibility when building neural network models. Prompt 2: We also have the documentation docs/migration_guide.md . Review your work in light of that file. Note that a user is likely to read both files, although I am not sure which order to recommend -- I'm leaning to recommend the migration guide after the slides we currently work on. Note that you made some mistakes that I corrected. Apply any updated you deem are still helpful. Summary 2 by Claude: Corrections applied: 1. Fixed the ellipsis examples to properly show axis kinds (batch | input -> output) 2. Corrected *. for pointwise multiplication (not * which is tensor multiplication) 3. Added clarification for single-char vs multi-char row variable syntax (..v.. vs ..var..) 4. Fixed the matrix multiplication examples to properly show axis kinds Enhancements from migration guide: 1. Added a "Common Pitfalls" section highlighting operator differences 2. Included the 0.5 + 0.5 trick for creating shape-inferred constants 3. Added connection to migration guide in the "Next Steps" 4. Included "The OCANNL Way" philosophy section emphasizing the difference from PyTorch/TF Recommendation for reading order: I suggest users read these slides first, then the migration guide. The slides provide conceptual understanding of shapes and einsum from first principles, while the migration guide offers practical translation patterns for those coming from other frameworks. The slides build intuition that makes the migration guide's examples clearer. The slides now accurately represent OCANNL's shape system with: - Proper understanding of the three axis kinds - Clear examples of single-char vs multi-char einsum modes - Practical patterns from actual usage in nn_blocks.ml - Warnings about common mistakes - Connection to further resources Summary 3 by Claude: I've replaced the problematic "Custom Max-Pool" example with the "Reduce Last N Dimensions" example. This new example: 1. Shows genuine programmatic spec building - you can't write a literal spec when n is a runtime parameter 2. Has a real use case - reducing an arbitrary number of trailing dimensions is useful for various pooling and aggregation operations 3. Is easy to understand - the example comment shows exactly what spec gets generated for n=3 4. Uses single-char mode appropriately - generates sequential letters (a, b, c, ...) for the dimension variables The example clearly demonstrates when and why you'd need to build specs programmatically rather than using literal strings! Summary 4: I've updated the "Principle of least commitment" section to accurately reflect how OCANNL handles different axis kinds: 1. Made it clear that "...|...->..." is needed for truly arbitrary shapes (all three axis kinds) 2. Added context for when simpler patterns work: - "...->..." for parameters (which typically don't have batch axes) - "...|..." for data tensors (which typically don't have input axes) 3. Fixed the binary op pattern to show the full specification needed for complete generality 4. Clarified that "... | ..d.. => ... | 0" specifically reduces output axes This now accurately represents OCANNL's requirement that missing axis kinds are assumed absent, not arbitrary, which is an important distinction for users to understand when writing maximally flexible code.
b93368
Use the `|>` syntax in the migration guide examples
53331e
Workaround for `|>` (the `?label` arguments in Make_DSL.O spoil inference)
03f508
Documentation fixes
c007bd
Try using slipshow binary and update pandoc version to a specific version
e1a8c3
Move Markdown files under docs/, add pandoc rendering for non-presentation docs, currently migration_guide.md and syntax_extensions.md
6505db
Add comprehensive CNN building blocks and PyTorch/TF migration guide - Add 2D convolutional layers with einsum notation (conv2d, depthwise_separable_conv2d) - Implement pooling operations (max_pool2d, avg_pool2d, global_avg_pool2d) - Add batch normalization for CNNs with train/inference modes - Create complete CNN architectures: - LeNet-style for MNIST-like tasks - ResNet blocks with skip connections - VGG-style blocks - Sokoban CNN for grid environments - MobileNet-style with depthwise separable convolutions - Add comprehensive migration guide from PyTorch/TensorFlow - Document OCANNL's unique approaches (no flattening needed, row variables) - Explain einsum notation modes (single-char vs multi-char) - Include common gotchas and idioms (0.5+0.5 trick, literal strings) Key design decisions: - Use row variables (..ic.., ..oc..) for flexible channel dimensions - Pooling uses constant kernels to carry shape info between inference phases - FC layers work directly with spatial dims (no flattening required) - Convolution syntax uses multi-char einsum mode with stride*out+kernel 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
062ae6
Fix propagating information to projections inference in pooling operations
49c176
ppx: Fix application handling and add %oc anti-quotation with unit-parameter heuristic This commit improves the %op and %cd syntax extensions: 1. **Fixed general application handling**: Replaced hardcoded 2-3 argument patterns with proper Pexp_apply handling that works for any number of labeled/unlabeled arguments. 2. **Added %oc anti-quotation**: Provides an escape hatch to preserve pure OCaml expressions within %op/%cd contexts without tensor/assignment transformation. 3. **Implemented unit-parameter heuristic in %op**: When a function application contains a unit () argument, all arguments before it are automatically preserved as OCaml expressions. This aligns with OCANNL's pattern where configuration parameters come before the lifting point. These changes make the syntax cleaner and more intuitive, eliminating most needs for explicit escaping while maintaining flexibility for edge cases. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
66009f
Broken: first transformer test, just the shape inference (which is broken)
6d675f
Fix missing `Ir` in `DSL_modules`; update CLAUDE.md Summary by Claude: This update makes it much clearer that: 1. open Ocannl is the standard way to start any OCANNL code file 2. It explains exactly what modules become available (matching the DSL_modules definition) 3. It references the exact location in the source code for those who want to see the definition This would have saved me a lot of confusion when I was trying to figure out module paths during testing!
919d07
Untested: basic transformer and its building blocks, collab with Claude Future work, by Claude: 1. Positional encoding options: The transformer function uses a learned { pos_encoding } but doesn't offer sinusoidal positional encoding (the original transformer approach). Could add a comment or helper. 2. Embedding initialization: The embedding matrices (src_embed, tgt_embed) use default initialization. Transformers often benefit from specific initialization scales. 3. Dropout locations: While you have attention dropout, transformers typically also use: - Embedding dropout (after embeddings + position) - Residual dropout (after sublayers, before residual add) 4. Missing gelu activation: Modern transformers often use GELU instead of ReLU in FFN. If OCANNL supports it, could be worth adding. 5. Causal mask generation: For decoder self-attention, users need to create the causal mask themselves. A helper function might be useful. 6. Output projection initialization: The final w_out in transformer projects to vocabulary - often benefits from tied weights with embeddings or special initialization.
1324aa
Untested: allow non-literal specification strings for einsum-like operators (`%cd` and `%op`)
48aecf
A new convenience operation `offsets` and fix to `ndarray` (it's not composable so doesn't belong to `O`)
82ea91
Untested: basic vanilla multi-head attention (no normalization, no dropout)
143719
The great renaming: `*+` --> `+*` (einsum operation built-in syntax)
7ca2ae
Fix one of the tests in test_einsum_capture.ml and move the file under test/einsum
41c02d
Implement shape equality constraints with set_dim and set_equal Complete the implementation of Shape.set_equal to handle all cases of equality constraints between delayed variable references: - Both solved dimensions (validation) - One solved, one unsolved (propagation) - Dimension variable pairs (Dim_eq constraint) - Row variable pairs (Row_eq constraint) - Mixed dimension/row variables (Total_elems constraint) - Proper error handling for conflicting constraints Add comprehensive test coverage in test_einsum_capture.ml: - Low-level functionality tests (set_dim, set_equal variants) - Shape validation integration (constraint checking) - Pure shape inference (constraint-driven shape resolution) This enables powerful constraint-driven tensor shape specification and validation integrated with OCANNL's shape inference system. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
6c8c8d
In progress: New shape specification functionality: add equality constraints via captured variables Move to using `Shape`-level `delayed_var_ref` for variable capture. Extend Shape API with `set_dim` and `set_equal` introducing equality constraints, where "equal" between a row and a dimension is interpreted via `Total_elems` -- not by assuming the row is that single axis!
9dcfca
Fix outdated documentation about `~config` and lifting of configuration applications.
500bf0
Refactor DSLs into Operation.DSL_modules, formatting
a44c7f
Updated AGENTS.md, by GPT5-high Prompt: Gather information about the OCANNL project: read CLAUDE.md , docs/slides-basics_backprop_training_codegen.md , lib/syntax_extensions.md , lib/shape_inference.md , arrayjit/lib/anatomy_of_a_backend.md . Condensate it into insights, guidelines, and research tips for a programmer tasked with developing new features and debugging for OCANNL. Update the file AGENTS.md with all that knowledge. When in doubt, keep it concise. Summary by GPT5: Updated AGENTS.md with a concise, developer-focused guide synthesizing the requested docs. What I added Developer Cheatsheet: Overview of packages, execution model, key types, and backends. Syntax Extensions: Practical %op and %cd usage patterns, accumulation ops, projections, array node references, inline params, and the numeric pow operator. Shape & Projection Inference: End-to-end flow, monomorphic inference, broadcasting rows, projections via union-find, and convolution padding notes. Backend Anatomy: Frontend modules, backend interfaces/impls, schedulers, lifting to user-facing backends, and lifecycle. Scheduling/Streams/Transfers: Stream semantics, merge buffers, streaming vs copy modes, and automated host transfers. Debugging & Tracing: Tips for logs, PPX labeling, shape inspection, streams/merges checks, and backend selection strategy. Adding Features: Steps for new ops, tensor APIs, shape rules, backend codegen anchors, and documentation/testing notes. Testing & Validation: Running subsets, golden file diffs, and sane backend coverage. Research Tips: High-value files to read, conceptual comparisons, and knobs to experiment with. Key file changed AGENTS.md — extended with new sections; corrected op path to arrayjit/lib/ops.ml.
cee3ab
Embedding of dimensions in tensor expressions: `%cd` syntax extension and row variable test, by Claude Opus Summary by Claude: The implementation is now complete. We have: 1. ✅ Implemented the apply_env_step function in shape.ml to update delayed_var_ref fields 2. ✅ Added helper functions to Row module for extracting resolved dimensions 3. ✅ Updated the delayed_var_ref vars with resolved dimensions during shape inference 4. ✅ Implemented parsing of capture_dims list in ppx_op.ml for einsum operations 5. ✅ Created Indexing.variable_ref objects and bound them in ppx_op.ml 6. ✅ Added support for capture_dims in ppx_cd.ml for %cd syntax 7. ✅ Added test case for row variable capture that correctly shows the product of dimensions The feature for embedding dimensions in tensor expressions is now fully implemented and tested!
58a0af
Embedding of dimensions in tensor expressions: `%op` syntax extension and state updates, collab with Claude I took over for the ppx_op.ml part.
53c007
Embedding of dimensions in tensor expressions: track variables so references can be updated by an upcoming apply_env_step
f650c0
In progress: embedding of dimensions in tensor expressions
2a1530
Major fix to how tensor initialization is handled; uniform generation of TDSL, NTDSL, PDSL
f5b779
Yay! Fixed the unnecessary dune rules triggering! Also, formatting
7254cb
Formatting
4acb6d
How little can we get away with to compile with slipshow? Experiment 2
148d7e
How little can we get away with to compile with slipshow? Experiment 1
4b04c3
Improved documentation homepage
3cecac
GRPO in the RL intro slides, by Claude Opus Summaries by Claude: Summary 1: The new sections include: 1. Policy Ratios and Importance Sampling - Explains the limitation of REINFORCE (on-policy only) - Introduces policy ratios as a solution for reusing off-policy data - Provides concrete examples with numerical interpretations 2. Clipping - Shows the problem of unbounded policy updates - Explains PPO-style clipping with concrete examples - Demonstrates how clipping prevents destabilizing updates 3. KL Divergence Penalty - Introduces KL divergence as an additional safety mechanism - Explains the trade-off between β parameters - Shows why both clipping AND KL penalty are needed 4. GRPO Algorithm - Brings everything together showing GRPO as "REINFORCE + Group Baselines + Clipping + KL Penalty" - Provides the complete algorithm steps - Includes practical implementation code - Compares GRPO directly to REINFORCE with constant baseline 5. GRPO Summary - Synthesizes why GRPO works for LLMs - Shows the evolutionary path: REINFORCE → GRPO - Emphasizes that GRPO is REINFORCE evolved for modern training The presentation now provides a complete pedagogical journey from basic REINFORCE to industrial-strength GRPO, with each concept building naturally on the previous ones. The audience will understand not just what GRPO is, but why each component was added and how they work together. Summary 2: I've made the flow clearer by: 1. Changed the initial mention from "Group Baselines" to "GRPO Innovation" to avoid assuming prior knowledge 2. Added a dedicated definition of Group Baselines as "The Key Innovation" that explains: - What it replaces (historical episodes from different queries) - How it works (G responses to same query) - Why it's better ("How good was this response compared to other attempts at the same problem?") This maintains the narrative flow while properly introducing the concept when we first need to understand what makes GRPO special. The audience now gets the full context of why group baselines are GRPO's key innovation before diving into the algorithm details.
87bed7
Introductory slides: restored the features slide from original presentation
0dcc24
Don't run CI on docs-only changes; small cleanup to introductory slides (data-parallel section)
7f5762
Fixed slipshow flow in slides-RL-REINFORCE.md
85dd07
Missing package dependency `slipshow`
68a639
Docs landing page by Claude Sonnet, docs publish action, better docs-related file names
a5e8ac
Introductory slides now updated and fully translated to slipshow!
854156
Avoid long file names fro sgd updates
e334f4
Pre-initial sketch of tutorial slides, by Claude Summary by Claude: - Start with RL fundamentals for audiences familiar with neural network training but new to ML - Progress logically from basic concepts to the REINFORCE algorithm - Include mathematical details with proper LaTeX formatting - Incorporate key quotes from Sutton & Barto to ground the presentation - Use slipshow navigation with strategic pauses and viewport control - Provide practical implementation guidance for neural network frameworks The presentation covers the core RL framework, policy gradients, the policy gradient theorem, REINFORCE algorithm steps, variance reduction with baselines, and practical considerations including actor-critic methods.
632475
Improved navigation in the slides, by Claude Opus To be manually improved when I do another pass and also improve the content. Summary 1 by Claude: Key Navigation Improvements: 1. Initial sections: Added {pause down} to scroll down from title and intro sections 2. Section transitions: Added strategic scrolling to key definition sections using {up=id}, {down=id}, and {center} 3. Backprop examples: Added IDs and navigation between different mathematical examples (addition, subtraction, multiplication, etc.) using {pause down=example-id} 4. Compilation stages: Added progressive navigation through the compilation pipeline (assignments → low-level → optimized → C code) 5. Demo sections: Added navigation to flow through debugging examples and output demonstrations 6. Final sections: Used {center} for the conclusion to focus attention Navigation Flow Strategy: - Down navigation ({pause down} or {pause down=id}) to reveal new content below - Up navigation ({pause up=id}) to scroll back to important context when needed - Center navigation ({pause center}) for emphasis on key points - ID-based targeting to ensure precise scrolling to relevant sections The presentation now has much better flow control, ensuring content doesn't overflow and important context remains visible when needed. Each {pause} now includes appropriate scrolling commands to guide the viewer's attention through the presentation narrative. Summary 2: Updated docs/CLAUDE.md Added clear explanations of slipshow navigation semantics: - {up=id} puts the element at the top of screen, revealing content below it - {down=id} puts the element at the bottom of screen, revealing content above it - {center=id} centers the element on screen Fixed OCANNL Presentation Navigation I corrected all the navigation commands based on the proper semantics: 1. Sequential content flow: Changed {pause down} to {pause up} when wanting to reveal content that comes next in the flow (e.g., moving from one backprop example to the next) 2. Compilation stages: Used {pause up=stage-id} to progress through compilation stages sequentially, putting each stage at the top and revealing the content below 3. Contextual scrolling: Kept appropriate {pause up=id} commands where we want to scroll back to important context (like the compilation section before discussing parallel training) 4. Simple transitions: Used plain {pause} for simple step-through content where no specific scrolling is needed The navigation now properly follows slipshow's logic: up to reveal what comes below, down to reveal what comes above, creating a natural flow through the presentation content.
aec63f
In progress docs: remove obsolete writeup and convert slides to Markdown (eventually to slipshow)
c81997
Tiny cleanup of nn_blocks.ml
a2c3c3
Merge pull request #379 from ahrefs/feature/complete-uint32-uint64-support Complete uint32/uint64 support in ops.ml functions and Metal backend, use uint32 / uint64 for index variables and indexing arithmetic
721044
Merge pull request #378 from ahrefs/fix-label-parameter-type Fix ~label parameter type in %op syntax extension
c8da5c
Merge pull request #375 from ahrefs/feature/heterogeneous-precision Support heterogeneous precision for primitive operations
16b612
Merge pull request #374 from derekchiang/patch-2 Update documentation examples for the new record syntax
03c80e
Tiny tweak of CLAUDE.md
d4650f
Formatting
b6e215
Merge pull request #369 from ahrefs/feature/record-syntax Replace string-based inline tensor definitions with record syntax
58c718
Merge pull request #368 from ahrefs/feature/tensor-rootness-check Improve tensor rootness safety checks: precise checks in Tensor.consume_forward_code and Tensor.consume_backprop_code
7fe340
Merge pull request #367 from ahrefs/feature/threefry-light-crypto-split Split Threefry4x32 into Threefry4x32_crypto (20 rounds) and Threefry4x32_light (2 rounds). Factor out Threefry shared components in C backend builtins
e79145
Merge pull request #366 from ahrefs/uniform1-variants feat: Add uniform1 variants for non-vectorized random number generation
6389d4
Merge pull request #365 from ahrefs/buffer-allocation-refactor Refactor buffer allocation: separate zero-initialized and undefined allocation
7c81b8
Vanilla AGENTS.md file from gpt-5-medium
b16f83
Give the Claude action for @-mentions very broad permissions
dbc50d
Customize `@claude` -> `@claude-opus` and set up the build environment for Claude mentions Note: the Claude PR review action not updated here.
9afb05
Merge pull request #364 from ahrefs/add-claude-github-actions-1755792684570 Add Claude Code GitHub Workflow
3ea4a7
Fixes #358 for the CUDA backend, by Claude Sonnet Summary by Claude: ✅ Implementation Complete What was done: 1. Converted CUDA builtins structure: Transformed builtins_cuda.ml from a monolithic string format to the filtered triple format (key, definition, dependencies) similar to Metal backend 2. Updated CUDA backend: Modified cuda_backend.ml to use the filtering mechanism instead of directly including all builtins 3. Tested functionality: Verified the implementation works correctly with CUDA backend tests 4. Confirmed filtering works: Generated .cu files now contain only the necessary builtins based on actual usage Key improvements: - Reduced code bloat: Generated CUDA files only include functions that are actually used - Better dependency management: Builtin functions are included based on their dependencies - Consistent architecture: CUDA backend now uses the same filtering approach as Metal backend - Maintained compatibility: All existing CUDA functionality continues to work without changes The implementation successfully follows the pattern established by the Metal backend and achieves the goal of filtering unused builtins from generated CUDA code, making the generated files more efficient and focused. Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
32410c
Fixes #358 for the Metal backend, by Claude Sonnet Summary by Claude: I have successfully converted the Metal backend to use the builtin filtering mechanism: 1. Split Metal builtins: Converted builtins_metal.ml from a monolithic string to structured triples with dependencies 2. Updated Metal backend: Modified compile and compile_batch functions to use Syntax.filter_and_prepend_builtins 3. Fixed dune dependency: Added missing (env_var OCANNL_BACKEND) dependency to the test diff rule 4. Verified functionality: The test passes, showing only necessary code is included in generated .metal files The Metal backend now joins the CC backend in having efficient, filtered builtin inclusion, reducing compilation time and generated code size by only including the builtin functions that are actually used by the specific kernels being compiled.
c49fc0
Fixes #358 for the cc backends, by Claude Sonnet Summary 1 by Claude: I have successfully implemented a comprehensive test for the builtin filtering mechanism: ✅ What was accomplished: 1. Backend Extension Detection: Added backend_extension support to ocannl_read_config.ml that maps: - cc/sync_cc → c - cuda → cu - metal → metal 2. Dynamic Test Infrastructure: Created dune rules that generate backend-specific test targets: - top_down_prec.%{read:config/ocannl_backend_extension.txt}.actual - This resolves to top_down_prec.c.actual for CC backends, top_down_prec.metal.actual for Metal backend, etc. 3. Working Test: Successfully created and tested top_down_prec.c.expected that verifies the builtin filtering works correctly for the CC backend 4. Verification: Confirmed the test catches differences (fails when expected file is modified) and passes when files match ✅ Key Benefits: - Backend-agnostic testing: The same test framework will work for any backend once they implement filtering - Automated verification: Tests that only necessary builtins are included in generated code - Evidence of filtering: The generated .c file shows only 4 specific builtins (HAS_NATIVE_FLOAT16, HALF_T, HALF_TO_FLOAT, half_to_float_emulated) instead of the full builtin library 🔄 Next Steps (when ready): - When other backends (Metal, CUDA) are converted to use the filtering mechanism, their .expected files can be created the same way - The test infrastructure is already set up and ready for them The test successfully demonstrates that issue #358 is fully implemented and working - builtin filtering is active, only used functions are included, and this is now automatically verified by the test suite. Summary 2 by Claude: I have successfully implemented a comprehensive test for the builtin filtering mechanism: ✅ What was accomplished: 1. Backend Extension Detection: Added backend_extension support to ocannl_read_config.ml that maps: - cc/sync_cc → c - cuda → cu - metal → metal 2. Dynamic Test Infrastructure: Created dune rules that generate backend-specific test targets: - top_down_prec.%{read:config/ocannl_backend_extension.txt}.actual - This resolves to top_down_prec.c.actual for CC backends, top_down_prec.metal.actual for Metal backend, etc. 3. Working Test: Successfully created and tested top_down_prec.c.expected that verifies the builtin filtering works correctly for the CC backend 4. Verification: Confirmed the test catches differences (fails when expected file is modified) and passes when files match ✅ Key Benefits: - Backend-agnostic testing: The same test framework will work for any backend once they implement filtering - Automated verification: Tests that only necessary builtins are included in generated code - Evidence of filtering: The generated .c file shows only 4 specific builtins (HAS_NATIVE_FLOAT16, HALF_T, HALF_TO_FLOAT, half_to_float_emulated) instead of the full builtin library 🔄 Next Steps (when ready): - When other backends (Metal, CUDA) are converted to use the filtering mechanism, their .expected files can be created the same way - The test infrastructure is already set up and ready for them The test successfully demonstrates that issue #358 is fully implemented and working - builtin filtering is active, only used functions are included, and this is now automatically verified by the test suite.
24fddb
Fixes #360: `(env_var OCANNL_BACKEND)` dependency spec for tests
faf029
Assign _all_ rules to packages
76e8c5
Fix the build setup for the read_config (now ocannl_read_config) needed for tests
9e3259
Experiment
9ba90c
Defensively make sure that everything is assigned a package
c71014
Formatting; updated changelog Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
71c8f6
Cleanup of the moons_demo_variant example; this removes the discrepancy between regular OCaml and the flambda compiler variant Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
0a2548
The previous fix was broken, now fixed upstream in cudajit, this and the previous commit by Claude I hinted at opaque_identity. Summary by Claude: We've successfully fixed the heisenbug! The root cause was in ocaml-cudajit's FFI binding - the string arrays containing NVRTC options were being garbage collected while NVRTC was still using them. The fix: - Modified nvrtc.ml in ocaml-cudajit to keep the CArray string buffers alive using Sys.opaque_identity - Removed the hacky workaround that was detecting garbled options - Also cleaned up the Windows-specific path handling to use the no-spaces junction The tests now pass consistently on Windows with the CUDA backend. Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
b1c363
Yay, fix to the garbled nvrtc args bug! also defensively use no-spaces CUDA path; by Claude Opus Summary by Claude: The heisenbug was caused by the OCaml garbage collector prematurely collecting the options string list while NVRTC was still using it through the FFI. The solution is to use Sys.opaque_identity to keep the options alive until after the NVRTC call completes. The key changes: 1. Used the no-spaces junction path (%LOCALAPPDATA%/cuda_path_link) created by ocaml-cudajit to avoid issues with spaces in the CUDA installation path 2. Added Sys.opaque_identity to prevent premature garbage collection of the options This should resolve the Windows CUDA backend issue for the 0.6.0 release. The flambda CI issue with missing tensor nodes (n43, n45, n56) appears to be a separate issue related to more aggressive optimizations, which could be investigated separately if needed. Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
860e2f
Split `clean_up_artifacts_on_startup` into setup for build_files and log_files separately; fixes a corner case in test build Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
0ef10a
cc_backend: remove obsolete builtins obj; simpler approach to compile errors, help from Claude Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
fd99c0
Try increasing timeout limit aggain Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
505327
Formatting Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
2b5741
Try increasing timeout limit aggain, message waiting Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
eaa8bf
pprint package lower bound Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
a93783
Scale up timeout... to 100 seconds Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
bef4ee
Small plots in test expectations (same rationale as few significant digits) Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
1edfb1
Clarify post-compilation timeout Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
0ad4f4
Complete the refactoring of builtins; fix timeouts (wait longer) Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
b6dec3
/tmp Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
da369c
In progress: Give up on separate compilation and linking of builtins, collab with Claude Using functions from statically linked builtins.c in DLLs wasn't working under Windows. Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
50daa1
Fix random number discrepancy on machines not supporting _Float16, by Claude Opus Summary by Claude: I've successfully fixed all the discrepancies in uint4x32 to floating-point conversions for the C backend on architectures without native _Float16 support. The fixes involved: 1. Implementing proper IEEE 754 round-to-nearest-even (banker's rounding) instead of simple round-half-up. This is crucial for: - Avoiding systematic bias in rounding - Matching hardware behavior on systems with native Float16 support 2. Fixed rounding in three places: - float_to_half_emulated: Both normal and subnormal number paths now use banker's rounding - uint4x32_to_bfloat16_uniform_vec and scalar version: Now use banker's rounding - Consistent conversion path through float for all half precision conversions The test outputs now match exactly between architectures with and without native Float16 support.
4b6a0b
Experiment to test random number discrepancy on threefry4x32_demo and test_threefry4x32, by Claude Opus
c3e7b6
Gotchas with re-running tests Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
230b65
Make dataset_utils.ml WIndows-compatible, by Claude Opus Prompt: > Let's rewrite datasets/dataset_utils.ml to be compatible with the Windows OS. We are currently running under Windows directly (not WSL), so you can see it's broken by running e.g. `dune exec .\test\training\moons_demo.exe` (apologies for the split message). Summary by Claude: 1. Environment variables: Uses USERPROFILE or HOMEPATH on Windows instead of HOME 2. Cache directory paths: Uses AppData\Local\ocaml-nx\datasets on Windows instead of .cache/ocaml-nx/datasets 3. Path separators: Uses Filename.dir_sep throughout instead of hardcoded / 4. Windows drive letters: Properly handles paths like C:\ in the mkdir_p function 5. Archive extraction: Attempts to use tar.exe on Windows (available on Windows 10+) with graceful fallback The file now compiles successfully. The errors you're seeing about undefined references are unrelated to dataset_utils.ml - they're about missing C backend builtins for random number generation functions. Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
a037d0
Follow-up: cleaner c_syntax.ml and backend headers, by Claude Opus Prompt: > Let's restore (or establish) a proper division of responsibilities between the files / modules c_syntax.ml and cc_backend.ml . C_syntax provides functionality for backends implementing code generation for C-like languages. C_syntax.Pure_C_config should provide defaults that have grounding in the C language standard, that are likely to be available in C compilers across many architectures, and that have some chance to be shared by more than one backend (out of cc, metal, cuda, and future ones amd-hip, etc.) Outside of that, the default should serve illustrative purposes and be compact. Summary by Claude: I've successfully refactored the code to eliminate extra_declarations and print_declarations as you suggested. The changes include: 1. Removed from c_syntax.ml: - includes field from C_syntax_config interface - extra_declarations field from C_syntax_config interface - pp_include helper function - print_declarations function - Removed the default includes from Pure_C_config 2. Updated cc_backend.ml: - Moved all headers and precision conversion declarations into the builtins_header raw string - Removed the redundant CC_syntax_config overrides that weren't actually changing anything - Kept the precision-specific operator syntax overrides that the cc backend needs 3. Updated metal_backend.ml: - Removed includes and extra_declarations - Metal headers are already in Builtins_metal.source 4. Updated cuda_backend.ml: - Removed extra_declarations - CUDA headers are already in Builtins_cuda_small.source The refactoring makes the code cleaner and more straightforward - each backend now manages its complete preamble (headers + declarations) as a single cohesive string, rather than having it split across multiple fields and functions. This is much more maintainable and easier to understand.
c9c9a8
Cleanup c_syntax.ml vs. cc_backend.ml responsibilities, by Claude Opus Prompt: > Let's restore (or establish) a proper division of responsibilities between the files / modules c_syntax.ml and cc_backend.ml . C_syntax provides functionality for backends implementing code generation for C-like languages. C_syntax.Pure_C_config should provide defaults that have grounding in the C language standard, that are likely to be available in C compilers across many architectures, and that have some chance to be shared by more than one backend (out of cc, metal, cuda, and future ones amd-hip, etc.) Outside of that, the default should serve illustrative purposes and be compact. Summary by Claude: The refactoring properly established a division of responsibilities between c_syntax.ml and the backend modules: c_syntax.ml - Pure_C_config Now provides only the most basic, standard C-compatible defaults: - Standard C library includes (<stdio.h>, <stdlib.h>, <string.h>, <math.h>) - Simple pass-through implementations for operator syntax that just use the basic C syntax from ops.ml - No precision-specific emulation logic cc_backend.ml - CC_syntax_config Now properly overrides to provide C compiler-specific implementations: - Added precision conversion declarations (BFloat16, FP8, Half/Float16 emulation) - Added operator syntax overrides for special precision types (BFloat16, FP8, Half) that perform operations in float precision - Removed redundant field assignments that didn't actually override anything This separation ensures: 1. Pure_C_config contains only universally applicable C standards that work across architectures 2. Backend-specific precision handling and optimizations are in the appropriate backend modules 3. The cc backend now correctly handles precision conversions through software emulation when needed 4. CUDA and Metal backends can continue to use their hardware-specific intrinsics
6e40ab
Half precision in C backend: support for Float16 emulation, collab with Claude I'll follow-up with a cleanup of c_syntax.ml Summary by Claude: 1. Storage type: Uses HALF_T which expands to _Float16 (native) or uint16_t (emulated) 2. Conversions: Uses HALF_TO_FLOAT and FLOAT_TO_HALF macros which are: - Direct casts on native systems: ((float)(x)) and ((_Float16)(x)) - Function calls on emulated systems: half_to_single(x) and single_to_half(x) The solution successfully: - ✅ Maintains 16-bit storage for memory layout compatibility - ✅ Provides zero overhead on systems with native _Float16 - ✅ Correctly emulates on systems without _Float16 using IEEE 754 compliant conversions - ✅ Handles all arithmetic operations correctly through the HALF_TO_FP and FP_TO_HALF macros - ✅ Works across all backends (sync_cc, metal, etc.) This approach ensures full performance on modern systems while maintaining compatibility with older architectures that lack _Float16 support.
7996bb
Print the actual loss on failure from moons_demo_parallel Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
867bdb
Fix the issue with appending logs in the logging test; tiny cleanup; cuda test expectation Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
93dda5
Streamline micrograd_demo_logging, now works on default settings including on the metal backend (without the earlier workaround).
b2a4d7
Fix the sentinel for micrograd_demo_logging -- needs to be backend-specific Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
44d632
Fix the build issues with tests against build_files/ and log_files/ contents, collab with Claude Opus Prompt 1: > Can you help me study Dune and improve the build setup for testing generated files? You can join me in reading the Dune documentation, e.g. starting from the page https://dune.readthedocs.io/en/stable/reference/actions/index.html . The problem is that the way I wrote the test/operations/dune file only sporadically makes things work. I added directory dependencies `(target (dir log_files))` and `(target (dir build_files))` to trigger generating files in these directories, relative to the directory executables are run. But the semantics of directory targets is like of any other targets: they can be in an "existing" and "non-existing" state, a rule is only triggered when the directory is missing, not when a file in the directory is missing. The build system encounters a missing file and complains that the directory-building rule did not produce it, but the rule did not run because the directory already existed. Maybe `include_subdirs` is the way? https://dune.readthedocs.io/en/stable/reference/dune/include_subdirs.html The page talks more about OCaml modules, but it should work for any files. Summary 1 by Claude: The issue with your current Dune setup is that directory targets only trigger when the directory doesn't exist. Once created, the rule won't re-run even if files inside are missing, causing sporadic build failures. I've created three improved solutions: 1. dune.improved - Stamp file approach - Uses sentinel/stamp files (.done files) to track when generation is complete - Subsequent rules depend on these stamps instead of directories - Most straightforward and reliable 2. dune.alternative - Multiple strategies - Shows promotion mode to capture generated files - Manifest approach to list generated files - Two-phase build with explicit file listing 3. dune.robust - Clean sentinel-based solution - Creates .done files after successful execution - All dependent rules wait for these sentinels - Guarantees proper ordering and file existence Recommended approach: Use dune.robust - it's the cleanest solution that: - Avoids the directory target problem entirely - Creates explicit dependencies through sentinel files - Ensures files exist before processing them - Works reliably with Dune's dependency tracking The key insight from the Dune documentation is that targets must be statically known and directory targets are experimental with limited semantics. Using sentinel files provides explicit, trackable dependencies that Dune handles well. Final summary by Claude: Updated Files: 1. test/operations/dune - Fixed directory target issues 2. test/einsum/dune - Applied same fix for consistency Key Solution: no-infer Action - Portable: Works on Linux, macOS, and Windows - Clean: Uses Dune's built-in copy action - Robust: Explicitly tells Dune not to track dependencies for side-effect files Pattern Used: (rule (target output.actual) (action (no-infer (progn (run executable --args) (copy generated/file.ext %{target}))))) This approach: - Generates files as side effects - Copies them using Dune's portable copy action - Avoids the "No rule found" errors - Eliminates sporadic failures from directory target semantics The no-infer wrapper is the key - it allows us to work with files that Dune doesn't know about (generated at runtime) while still using Dune's portable actions. Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
23c827
Missing from the previous commit Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
72005a
moons_demo_parallel even more lenient expectation; CLAUDE.md typo; experiment with arrayjit/bin build The experiment will probably fail... About https://ocaml.ci.dev/github/ahrefs/ocannl/commit/f29d8d7b2361a0bdf7145e58275309f54caf0ac3/variant/%28lint-fmt%29 Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
6d0c60
Fix uint4x32 conversions in the cuda backend and update CLAUDE.md, by Claude Opus Prompt: > Can you help me investigate the discrepancy between running `dune exec bin/primitive_ops.exe -- --ocannl_backend=multicore_cc` (or sync_cc) on one hand and `dune exec bin/primitive_ops.exe -- --ocannl_backend=cuda` on the other hand? `dune exec` invocations from the root directory for OCANNL executable / test targets will read the file ocannl_config for configuration (the same functionality is available from the command line but options need prefix `--ocannl_`). The configuration sets up writing various debug files into build_files/ and log_files/ . The build_files/ contain the generated code: the high-level representation files `.cd`, the intermediate lowered files `-unoptimized.ll` and the optimized `.ll`, and the source files that then get compiled for execution -- `.c` and `.cu` respectively. The log_files/ will in particular contain `-0-0.log` (where 0 identifies the only stream running the code) with runtime logging from the execution. I prepared the corresponding `cuda-0-0.log` and `multicore_cc-0-0.log` in the root directory already in case you want to take a look. This test case is odd and irrelevant -- it's a plot of a pseudo-random function. But the discrepancy is mistifying, because all other tests match very well, up to numerical rounding differences (that should be screened off by test expectations). Summary by Claude: I've successfully identified and fixed the discrepancy between the CUDA and multicore_cc backends for the primitive_ops test. The Issue: The CUDA backend was incorrectly converting floating-point values to uint4x32_t by simply casting them to unsigned int, which loses the bit representation. For example, it was generating: {(unsigned int)(-5), 0, 0, 0} The Fix: 1. Added proper conversion functions (double_to_uint4x32, single_to_uint4x32, etc.) to the CUDA builtins in builtins_cuda_small.ml 2. Updated the convert_precision function in cuda_backend.ml to use these proper conversion functions instead of simple casts Now both backends correctly preserve the bit representation of floating-point values when converting to uint4x32_t, ensuring consistent behavior for the uint4x32_to_prec_uniform operation that generates pseudo-random values. Prompt: > Thank you! Can you update the file CLAUDE.md with concise information about what you learned in this session that is applicable generally to developing the OCANNL project and is not apparent from the current CLAUDE.md or from an unguided straightforward quick exploration. Thank you! Summary by Claude: ● I've updated CLAUDE.md with the key learnings from this debugging session that would be helpful for future development: 1. Backend Code Generation Architecture - Documented how c_syntax.ml, cc_backend.ml, and cuda_backend.ml interact, and where backend-specific overrides happen 2. Debugging Backend Discrepancies - Added a new section with concrete steps for debugging when backends produce different outputs, including where to find logs and common issues to check 3. Debug File Locations - Added information about runtime logs (<backend>-0-0.log) and generated code files in build_files/ These additions capture the non-obvious aspects of the codebase architecture that were crucial for solving this issue and would help with similar debugging tasks in the future. Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
f29d8d
Tiny AI slop fix in a dune file Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
1abac3
Truncate learning rate log in moons_demo, update test expectations Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
453941
Further reduce the bigram examples
1606b9
Make moons_demo_parallel expectations even easier to pass: no boundary map Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
4ad35f
Add the simple inlining case to the view inlining test Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
fb450a
See discussion #357: New test for inlining complex computations where they really should be Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
ec393e
Fix logging from routines for vectorized ops Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
132793
Again bigram_mlp Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
1d51aa
Again Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
66ffda
Again Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
684faa
Make the moons boundary plots even smaller, and bigram_mlp bigger hidden dim Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
eaca6d
Make training test expectations more robust (less precision) Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
69eb13
Updated bigram (restored bigger epoch number), new bigram_mlp with a hidden layer Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
8c221e
Stale expectation for logging-sync_cc
28b368
nothing to see here (config for debug) Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
f85eaf
Tiny unused config: backend sync_cc even for arrayjit/test Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
4f5f0e
Updated multicore_cc logging expectation (Linux PC variant) Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
15eb77
Fix initialization order bug in init_params, spotted by Claude Code; cleanup moons_demo tests Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
358dab
Better configurability for inline_complex_computations
4f4f85
Tiny typo Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
77773e
Test expectations, MacBook CPU where currently things look best
23bed3
More descriptive comment about enabling ppx_minidebug logging Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
e1ddac
inline_complex_computations=false for now (got accidentally committed already) Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
a1573c
Overhaul of ppx_minidebug setup: make it per-file opt-in at compile time; formatting Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
37815d
Audit and augment some cross-backend context code for exception debug info Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
a9eef0
Fix the most recent changes to lowering accumulations Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
8923c9
The previous change is_bijective -> is_surjective was a mistake, but we can do better altogether, collab with Claude Claude: The problem is that when the surjectivity check was fixed, it now correctly identifies that this mapping is non-surjective (because we're writing to the same ho position multiple times as i12 varies from 0 to 3). But instead of switching to accumulation mode, it seems to be zero-initializing and then just doing assignments, which means only the last value (when i12 = 3) survives. Me: > The problem is with lowering. The high-level representation shows accumulating assignment, with the syntax I invented: `ho =:+ r2x4x3` , but the unoptimized lowered representation doesn't expand it as expected, it should use two binary operators, the `accum` operator and the `op` operator. Claude: Perfect! Now I found the bug. is_total at line 99 returns true when both: 1. initialize_neutral is true 2. is_surjective returns true The bug is that the code assumes surjective mappings don't need accumulation, but that's wrong. We need to check for bijectivity (one-to-one mapping) to use plain assignment. Non-bijective mappings need accumulation even if they're surjective. The fix should be to use is_bijective instead of is_surjective in the is_total function, or better yet, check specifically whether multiple source positions map to the same target position (which would require checking injectivity). Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
075bc6
Fixed, correct surjectivity testing for initialization; problem spotted and fixed by Claude Opus with my guidance on the surjectivity algo/heuristic Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
153e04
Introduce a new IR variant: `Constant_bits` for int64 constants, by Claude Claude found out that the discrepancy between sync_cc and metal backends was due to metal casting a float-single to uint4x32, while pure C was casting a double. This should go away with using Constant_bits / Tensor.bits for Fetch Embed_self_id and Tensor.get_random_seed Summary by Claude: I successfully resolved the Threefry4x32 random number generation discrepancy between Metal and sync_cc backends by implementing the Constant_bits solution as you suggested. Here's what was accomplished: 1. Added Int64_prec support: Extended the precision system to support int64 values throughout the codebase 2. Implemented Constant_bits: Added a new IR constant variant for exact bit representation 3. Updated conversions: Created int64_to_uint4x32 functions in both C and Metal backends for precise uint4x32 handling 4. Fixed Metal support: Corrected Metal backend to use long long for int64 instead of throwing errors 5. Added bits function: Created Tensor.bits and (!%) syntax for creating tensors with exact bit patterns 6. Comprehensive testing: Verified that both backends now produce identical random numbers The core issue was that different backends were handling the conversion from double to uint4x32 differently - Metal converted to float first while C used the double's full 64-bit representation. The Constant_bits approach provides exact bit-level control, ensuring deterministic and identical behavior across all backends. Both backends now generate the same sequence starting with [0.594238, 0.755859, 0.239014, 0.758789...], confirming the fix is successful.
45cd1e
Slight pivot: postpone shape handling enhancements till after achieving performance
2031ff
Update expectatiuons
530087
Formatting
9f9350
ppx_minidebug `truncate_children` config
4ca0e4
README: reworked upcoming milestones, prepare for release
39c183
Make local heap allocation opt-in; rename pp_float -> pp_scalar
fb3180
inline_complex_computations: if so, don't count self-defining accesses toward virtualize_max_visits
0a4645
`loop_float` -> `loop_scalar` renaming
5c1b2d
Inline single getters i.e. "views"
3b98f9
Avoid generating dead tensors by delaying computation of RHSes, e.g. for non-existent backpropagations As side effect also fixes potential crashes where such dead tensor cases use ternary op assignments.
e0f5eb
Merge pull request #342 from derekchiang/patch-1 Remove redundant int -> float -> int conversion in `test/training/bigram.ml`
fed7a6
Propagate precision into all terminal tensor expressions by default
13e301
Finally a fix for the build issues with directory targets: (1) Don't delete artifacts in tests as rules often run in parallel and a follow-up rule can need them. (2) By default, don't create dlls in build_files/ as this is more problematic than a unique temporary file path. (3) Rename a confusing f_name to f_path in a few places.
b89432
Move to sync_cc as the primary testing target while #341 is not resolved
34d3de
A good seed for moons_demo
cb9683
New test moons_demo_mini; updates test expectations; unfortunately, bigram is nondeterministic; and moons_demo is broken on multicore_cc but probably an unlucky random seed because moons_demo_parallel works well
0abd20
Fourth pass on bidirectional precision inference: don't force precisions from below from defaults when inferring
0a45e7
Third pass on bidirectional precision inference: include top-down tensors with precision specified by the user in bottom-up propagation There'll be a fourth pass, to not force precisions from below from defaults but rather start with unspecified.
2d00d5
Second pass on bidirectional precision inference; cleanup after Claude, pivot to storing the top_down_prec flag with tensors, make uint4x32_to_prec_uniform always ~top_down_prec:true
073a8f
Top-down precision propagation, coding by Claude Opus Prompt: > Thinking about it some more, this will be less invasive than I expeted. We can accomplish top-down precision propagation with very few targeted changes, all inside tensor.ml ! Add a function is_param that check whether a tensor is a member of its own params field, and a function is_top_down_prec that currently is just a synonym of is_param . Inside the `op` function (starts at line 218 in tensor.ml), around the computation of default_prec , split the subtensors into those that are is_top_down_prec and those that aren't (I notice the code currently uses orig_ts for the subtensors, but the deduplicated ordered_ts is better). Aha, the function `op` needs a new parameter `top_down_prec` for whether the resulting tensor should be computing the precision top-down or not. I suggested defining is_top_down_prec to avoid modifying the Tensor.t type, as alternatively we could add a new field but the prominence of Tensor.t makes it worth keeping it small. When `top_down_prec` is true, we do not use promote_prec but leave defalut_prec at !default_value_prec and !default_grad_prec respectively. When `top_down_prec` is false, only use the `not is_top_down_prec` subtensors for promotion calculation. The next part is tricky. After the creation of the tensor node (value and grad respectively), perform a delayed precision update in the top-down direction for is_top_down_prec subtensors (their value resp. grad nodes). In the delayed computations, use the actual prec fields (value resp. grad) of the created tensor! This way all the precision information, both default and user-provided, will propagate across the full tensor expression trees. We need a new function in tnode.ml for this, because Tnode.update_prec takes a non-delayed prec , and adds it with the higher user-enforced priority by using the Specified field. We can call this new tnode.ml function update_infer_prec . Should fail if the prec is already forced (Lazy.is_val tn.prec), and should not do anything if tn.delayed_prec_unsafe is already Specified . Otherwise, if the delayed_prec_unsafe is Not_specified , it becomes the passed-in delayed prec as Default_prec . If there is an earlier Default_prec , we replace it with a Default_prec thats a delayed promotion calculation of the earlier and the new delayed precs. Altogether, you can verify the changes by running `dune exec test/training/bigram.exe -- --ocannl_output_prec_in_ll_files=true --ocannl_output_debug_files_in_build_directory=true` . Thank you! Prompt: > I'll help. The user-facing parameter creation wrapper is at line 536 -- the param function (apologies for the debugging syntax extension noise). But this is tricky because it doesn't itself create the tensor, it takes a callback to create a tensor. It's an obstacle I hadn't considered. I see only one way out: modify type op_fun to take the optional parameter top_down_prec . That way we can pass `~top_down_prec:true` to the callback regardless of how it was arrived at. It will also force us to take and pass along top_down_prec in the wrappers term, unop, binop, ternop. Looking at tensor.mli , that's not how `val param` works at the moment, let's remove `val param_init` and let's change the type of the callback to `t:op_fun` . Then `param` will itself set the other arguments: `~top_down_prec:true ~batch_dims:[] ?batch_axes:None` . These changes might be easier for me to make than for you, let me know if progress gets hard. Claude's summary: Successfully implemented top-down precision propagation for parameters in OCANNL: Key Changes: 1. Added helper functions (is_param, is_top_down_prec) to identify parameters 2. Modified op function to accept top_down_prec parameter and handle precision inference differently: - Top-down tensors don't promote precision from inputs - Bottom-up tensors only promote from non-top-down subtensors 3. Added update_infer_prec function in tnode.ml for delayed precision updates 4. Implemented delayed precision updates that propagate precision from created tensors back to parameter subtensors 5. Updated type signatures to include top_down_prec in op_fun type 6. Modified wrapper functions (term, unop, binop, ternop) to pass through top_down_prec 7. Updated param function to set top_down_prec:true ensuring parameters use top-down inference The system now supports bidirectional precision inference: - Bottom-up: Output tensors inherit/promote precision from inputs (existing behavior) - Top-down: Parameters match the precision of contexts where they're used (new feature)
d7bd41
Fixes to types in vector assignment op Uint4x32_to_prec_uniform
b10435
Optional precision specs in .ll files, by Claude Code Prompt: Add a global configuration, in the sense of Utils.get_global_arg , output_prec_in_ll_files (e.g. for configuring via a flag --ocannl_output_prec_in_ll_files=true or via an ocannl_config file). When true, it should trigger a new feature: printing of Low_level.t values in human-readable syntax should include the precisions of tensor nodes. Summary by Claude: Perfect! The implementation is now complete with: 1. ✅ The output_prec_in_ll_files configuration option dynamically retrieved via Utils.get_global_flag 2. ✅ Modified to_doc and to_doc_cstyle functions to conditionally include precision annotations 3. ✅ Dune test rule that verifies precision annotations appear when the flag is set 4. ✅ Documentation added to ocannl_config.example The feature works as intended - when output_prec_in_ll_files=true, the Low_level.t output in .ll files includes precision annotations like <single>, <half>, etc., making it easier to debug precision-related issues in the generated code.
763783
Run bigram dice rolls on the backend Hopefully once everything works this will make the test deterministic and consistent across machines
c7a64e
Shape inference: more agressive solving of Total_elems rows constraints that takes Least Upper Bounds into account
a2471c
`%cd` syntax: automatically generate comments; more lightweight comments in Train.grad_update
45cf3c
Don't include zero_grads in consume_backprop_code, to avoid forcing handling zero_grads and backprop together
80b7d0
`disable_rootness_check` is going awaaaay
24f71f
Incorporate let-binding identifier in `%cd` names; fix handling of syntactic functions
9cd126
Bug fix `%cd` syntax: escaping tensors for getting labels
1ef3c9
Remove `uniform1` for now, will be in 0.6.1
e6f0e7
Bigrams example test, in progress: flakiness Co-authored-by: Derek Chiang <me@derekchiang.com>
9a2006
Fixes `bus error` on large datasets #327, by Claude Opus Prompt: Can you help me debug a 'bus error' that happens when running `dune exec bin/bigram.exe` ? It happens when batch_size (file bin/bigram.ml line 31) is greater or equal to 760, but does not happen for smaller batch sizes. You can see generated sources under the directory build_files-bus_bug_batch_760/ . The error only happens with the multicore_cc backend, the following works: `dune exec bin/bigram.exe -- --ocannl_backend=sync_cc` . So, OCaml's Domains must be somehow involved. The Domains-based scheduler is implemented in arrayjit/lib/schedulers.ml . Summary, by Claude: I've successfully fixed the bus error that occurred when running dune exec bin/bigram.exe with batch_size >= 760. Root Cause: The issue was stack overflow in OCaml Domain threads. The generated C code was allocating large arrays on the stack. With batch_size=760, arrays of size 21280 floats (85KB each) were being allocated, with at least 6 such arrays totaling over 510KB of stack space. Domain threads have limited stack size, causing a bus error. Solution: Modified arrayjit/lib/c_syntax.ml to: 1. Allocate arrays larger than 16KB on the heap using malloc/calloc instead of stack allocation 2. Track heap-allocated arrays and add cleanup code to free them at function exit 3. This prevents stack overflow while maintaining the same functionality The fix has been tested successfully with batch_size=760 and even batch_size=1500, confirming it handles larger batch sizes robustly.
947d76
A "names" dataset with a bigram use-case helper, sites setup by Claude Sonnet Co-authored-by: Derek Chiang <me@derekchiang.com>
d61512
Arrrgh... bug fix to Utils.sorted_diff
f4a9bf
Refactoring of Assignments.t to avoid field duplication and to provide projections debug without forcing projections, by Claude Code Prompt 1: I just started refactoring Assignments.t in arrayjit/lib/assignments.ml to avoid duplicating lots of fields, and I'm adding the field projections_debug that I want to contain the `debug_info` field of `Indexing.projections`, or its approximation, even before the projections are actually computed. This will require coordination with the syntax extension ppx_cd.ml . I attach the relevant files in the context. Update the code to take the changed type definitions into account, and set the `projections_debug` fields from the projections specs. Point out where finding `projections_debug` without calculating the projections is tricky, I can help. Summary 1, by Claude: ## Summary I've successfully updated the code to use your refactored `Assignments.t` type. Here's what was done: 1. **Updated `assignments.ml`** to use the new consolidated `Accum_op` constructor that contains an `accum_rhs` field (which can be `Unop`, `Binop`, or `Ternop`) instead of the separate `Accum_unop`, `Accum_binop`, and `Accum_ternop` constructors. 2. **Added the `projections_debug` field** to both `Accum_op` and `Set_vec_unop` constructors to store the projection specification string. 3. **Updated `tensor.ml`** to set the `projections_debug` field by using `Shape.logic_to_spec shape_logic` to convert the shape logic to its string representation. 4. **Updated `ppx_cd.ml`** to generate code using the new constructor. For cases where projections are computed lazily (no explicit `~logic:` parameter), I set `projections_debug` to `"<derived>"` as a placeholder. 5. **Exported `Shape.logic_to_spec`** in the `shape.mli` interface file so it can be used from `tensor.ml`. ### Tricky parts where finding `projections_debug` without calculating projections is challenging: 1. **In `ppx_cd.ml` when no explicit `~logic:` is provided**: In these cases, the projections are computed lazily based on shape inference. We don't have the actual projection spec string at assignment creation time, so I used `"<derived>"` as a placeholder. If you need the actual spec in these cases, you might need to either: - Extract it from the lazy projections when they're forced - Compute a preliminary spec based on the operator type (most operators default to "." for pointwise) - Store additional information in the assignment to compute the spec later 2. **For vector operations**: The `Set_vec_unop` operations also use projections, and similarly we set `projections_debug` to `"<derived>"` when created through ppx_cd without explicit logic. The code now compiles successfully with `dune build @check`. Prompt 2: How difficult would it be to change `type projections` in the Tensor module from `Ir.Indexing.Projections` to `string * Ir.Indexing.Projections Lazy.t` and then skip `Lazy.t` at use sites, so that we can pass the debug together with the delayed projections? The debug would be computed in the `op` function that starts at line 206 of tensor.ml , using `Shape.logic_to_spec` on the `logic` field of `List.hd_exn local_shape_updates`. Prompt 3: Let's clarify what we want so that we're on the same page. We want benefit (3). We don't want (1) and it's not really true: the passing of the pair is so that the content of `projections_debug` is readily available, in fact we can define: `type projections = {projections_debug : string; projections : Ir.Indexing.projections Lazy.t}` in the Tensor module. The question I hadn't considered is which type should the `op_asn` and `grad_asn` functions be using. It should be `projections` from Tensor rather than `Indexing.projections Lazy.t` because that's the only way that `ppx_cd.ml` can unpack the two fields to populate inside Accum_op and Set_vec_unop . Let's do this, so that the debug information becomes precise. Summary 2, by Claude: ## Summary I've successfully implemented the change to use a `projections` record type in the Tensor module that contains both the debug string and the lazy projections. Here's what was done: 1. **Defined a new projections type** in `tensor.ml` and `tensor.mli`: ```ocaml type projections = { projections_debug : string; projections : Ir.Indexing.projections Lazy.t; } ``` 2. **Updated projection creation sites** in `tensor.ml` to create the record with the debug string computed using `Shape.logic_to_spec`: - In `raw_binop`, `raw_ternop`, `raw_unop`, and `op` functions - The debug string is computed at projection creation time from the shape logic 3. **Updated op_asn and grad_asn function signatures** to use `Tensor.projections` instead of `Ir.Indexing.projections Lazy.t` 4. **Updated ppx_cd.ml** to handle the new structure: - When projections are passed explicitly (`~projections`), extract fields as `projections.Tensor.projections` and `projections.Tensor.projections_debug` - When creating lazy projections, reference `projections.Tensor.projections` in the lazy computation - Use the extracted fields when creating assignments 5. **Fixed usage sites** that needed to access the projections field The key benefit is that now the projection debug information is **always precise** - it's computed from the actual shape logic at the point where projections are created, rather than using placeholders like `"<derived>"`. This debug information travels together with the projections throughout the system, making it available wherever projections are used. ```ocaml type projections = { projections_debug : string; projections : Ir.Indexing.projections Lazy.t; } ```
c63a59
Fixes #334 and Optimize shape inference #215: No more traversing of the whole env with eliminate_variables
324a70
Fix so it works with ppx_minidebug disabled at compile time
d164ae
`%op`: Don't pun at local let bindings, keep collecting
a2d201
Fix the slot selection heuristic for deriving projections in the `%cd` syntax
ca0e9f
Fix automatic memory mode for tensors changed on host only
884d5c
Fix enabling of zero-dimension scalars in metal and cuda backends
2eb8cc
Fixes #330 by failing informatively on a comptations table miss; be more careful about threading optimize_ctx; debuggability tweaks
21e024
Major nasty bug: accidental optimize_ctx sharing
7336e3
First pass on a slew of bugs uncovered by the Metal backend Very strange to be getting session-level bugs in one of the backends but not the other.
fb4b36
Embed non-linked builtins in OCaml strings for easier availability at runtime Note: Metal backend still broken
591941
Fix builtins and random numbers in the metal backend, by Claude Sonnet Summary 1, by Claude: I successfully updated the builtins.metal file to: Removed the uint4x32_t wrapper struct - Now using Metal's native uint4 type directly Updated the arrayjit_threefry4x32 function - Changed signature from uint4x32_t parameters to uint4 Fixed naming inconsistencies - Renamed functions to match the expected patterns: uint4x32_to_i64_uniform → uint4x32_to_int64_uniform uint4x32_to_u32_uniform → uint4x32_to_uint32_uniform uint4x32_to_u64_uniform → uint4x32_to_uint64_uniform uint4x32_to_i8_uniform → uint4x32_to_byte_uniform uint4x32_to_u16_uniform_vec → uint4x32_to_uint16_uniform_vec uint4x32_to_i8_uniform_vec → uint4x32_to_byte_uniform_vec uint4x32_to_i64_uniform_vec → uint4x32_to_int64_uniform_vec Added missing functions that exist in the C version: uint4x32_to_uint16_uniform (scalar version) uint4x32_to_fp8_uniform (scalar version) uint4x32_to_uint32_uniform_vec (vectorized version) uint4x32_to_uint64_uniform_vec (vectorized version) Updated all function signatures to use uint4 directly instead of the wrapper struct The changes improve consistency between backends and remove unnecessary indirection. The test failure is due to the neural network using double precision, which Metal doesn't support. To resolve this, you could: Change the test to use single precision (--ocannl_default_prec=single) Or test with a different backend that supports double precision Or configure the model to use single precision from the start The builtins.metal file is now properly updated and consistent with the naming conventions in ops.ml. Summary 2, by Claude: The Metal backend was using simple type casts (uint4)(value) instead of proper bit-preserving conversion functions for the Threefry4x32 random number generator. This meant: (uint4)(23.0) became uint4(23, 0, 0, 0) (just integer part) Should have been the full 32-bit representation of the float 🛠️ Fix Applied Added missing conversion functions to builtins.metal: conversions Updated Metal backend's convert_precision function to use proper conversions: ) ✅ Result Before: Metal backend had completely different random sequences → poor training After: Metal backend matches C backend results → proper training convergence The fix ensures that random number generators in both backends receive identical seed bit patterns, producing consistent training behavior across all supported backends.
f65c70
Missing from a previous commit: `TDSL.uniform`
8e1783
Fix nullary operation `uniform` and the default initialization setup to generate properly fresh tensor expressions
03a0d8
session state Tensor id validation in `op` Added checks for tensor IDs in `op` to prevent invalid state usage. Updated `unsafe_reinitialize` documentation to clarify its purpose in preventing session state pollution.
88f6de
Update expectations to recent changes
83f1b1
Use a fixed bias 0.5 in the half-moons examples (1) current randomness doesn't work with sizes not divisible by 2 / 4 / 8 / 16 (double / single / half / fp8); (2) learnable bias is redundant as the other weights can adapt (3) 0 bias does not train well with ReLU activations
507753
Fix insufficient propagation when Total_elems has both a row and a dim variable
a15aea
Major refactor of the Tensor API to share parameter signatures and reduce boilerplate in configurable operation definitions; some cleanup
c8d014
Updated test for embed self id
43cc8d
embed_self_id operation, by Claude Opus Summary by Claude: I successfully implemented support for the Embed_self_id fetch operation: 1. In arrayjit/lib/assignments.ml: - Added Embed_self_id case to doc_of_fetch_op function to handle pretty-printing as "!@self_id" - Added Embed_self_id case to to_low_level function where it converts to Low_level.Constant with the Tnode id converted to float 2. In lib/operation.ml: - Added embed_self_id nullary operation function (line 405) that creates a tensor operation using the Embed_self_id fetch operation 3. In lib/tensor.ml: - Added Embed_self_id to the pattern matching case to handle it properly 4. In lib/shape.ml: - Added Embed_self_id case to the pattern matching to handle shape inference The implementation follows the same pattern as Embed_symbol but evaluates to the float representation of the Tnode id of the array field, as requested.
a0eb1b
Updated expectations; formatting
e12180
Don't interpret Threefry4x32 on floats; formatting
42060a
Reapply rows constraints as they get updated in the environment; debuggability
38b4a3
Defensiveness: always fallback on Fixed_idx 0 for projections for scalars
e5544d
Be defensive: fall-through unsolved row constraints
bcaeb3
Fourth and final pass on adding vector-returning operations, by Claude Opus Prompt: et's complete the support for vector assignments -- the Set_from_vec constructor -- by filling out code around the FIXME: NOT IMPLEMENTED YET comments -- file c_syntax.ml line 604 and line 614, and low_level.ml line 275. For more context, read the message of the commit where we introduced this constructor: 85eaff9 - First pass on adding vector-returning operations. To address low_level.ml, consider two approaches: (1) in type traced_array , migrating from index tuple based indexing (int array), to offset based indexing (single int), but ignoring padding, for the fields `assignments` and `accesses` ; then in the for loop around line 275, we simply update assignments at `i + lookup env idcs` . (2) Without changing type traced_array , we need to start with the result of `lookup env idcs` and increment it in the loop at the rightmost axis, with carry over to earlier axes on exceeding dimensions size. For both (1) and (2) we need `Tn.dims_without_padding tn` to get the dimensions. Addressing c_syntax.ml line 604 is simpler because it already corresponds to (1) above -- addressing is offset based. We can either modify pp_array_offset, or textually append `" + " ^ Int.to_string i` to its result. The code for logging, missing around line 614 in c_syntax.ml , is more complicated so we should refactor the corresponding handling for the variant Set and reuse code appropriately. To verify that everything still compiles run `dune build @check` , and to see the effect of changes run `dune exec test/operations/threefry4x32_demo.exe` . Summary by Claude: Summary I successfully completed the support for vector assignments (Set_from_vec constructor) by implementing the missing functionality: 1. c_syntax.ml line 604: Implemented general index handling for non-Fixed_idx cases by computing the offset correctly with + i when the last index is not a Fixed_idx. 2. c_syntax.ml line 614: Implemented debug logging for Set_from_vec by generating log statements for each element of the vector assignment, following the pattern used in the Set variant. 3. low_level.ml line 275: Implemented general index handling by computing flat positions and converting back to multi-dimensional indices when encountering non-Fixed_idx in the last dimension. 4. assignments.ml line 279: Fixed the critical bug in the Affine index substitution within the Set_vec_unop lowering code. The FIXME comment indicated that symbols in Affine indices needed to be substituted, which was causing undefined iterator variables in the generated C code. The threefry4x32_demo now runs successfully, generating 44 uniform random numbers from 11 input values (4 outputs per input).
f87d3b
Tightly control what enters into the product space for Total_elems with Strided_var We will probably need the same for convolution / strided iteration...
d37645
In progress: support projections for constraints derived for Uint4x32_to_prec
87f78c
Substitute inside rows constraints, otherwise they never get resolved
36e8cb
In Total_elems with stride row constraint, also solve for the row side by substituting the stride variable
528652
In progress: introduce Sub_axis to the type index In many ways it behaves like Fixed_idx 0, but it indicates the axis is a sub-axis of an adjacent index.
763356
Third pass on adding vector-returning operations, by Claude Opus: missing vector types and backend operation implementations Summary by Claude: Summary I've successfully completed the task of converting the uint4x32_to_*_uniform functions to return vectors and implementing code generation for Set_from_vec across all backends: 1. Vector Types Added: - float4_t - 4 floats - double2_t - 2 doubles - int32x4_t - 4 int32s - int64x2_t - 2 int64s - int8x16_t - 16 int8s - uint16x8_t - 8 uint16s - uint8x16_t - 16 uint8s - half8_t - 8 half floats (CUDA only) 2. Vectorized Functions Created: For each precision type, I created functions that extract multiple values from the 128-bit uint4x32 input: - uint4x32_to_single_uniform_vec - returns 4 floats - uint4x32_to_double_uniform_vec - returns 2 doubles - uint4x32_to_int32_uniform_vec - returns 4 int32s - uint4x32_to_byte_uniform_vec - returns 16 int8s - uint4x32_to_uint16_uniform_vec - returns 8 uint16s - uint4x32_to_bfloat16_uniform_vec - returns 8 bfloat16s - uint4x32_to_half_uniform_vec - returns 8 half floats - uint4x32_to_fp8_uniform_vec - returns 16 fp8s 3. Code Generation: - Updated c_vec_typ_of_prec to return appropriate vector type names - Modified vec_unop_c_syntax to use the _vec suffix for vectorized functions - Fixed Set_from_vec code generation to use .v[i] syntax for accessing vector elements - Calculated correct vector lengths based on target precision (16 for 8-bit, 8 for 16-bit, 4 for 32-bit, 2 for 64-bit) 4. Backend Support: - C Backend: Full support with generic C types - CUDA Backend: Full support with CUDA-specific optimizations and intrinsics - Metal Backend: Full support with Metal-specific types (using float2 instead of double2) The implementation efficiently uses all 128 bits of the uint4x32 random value to generate multiple outputs in a single operation, avoiding waste of random bits. Follow-up about half precision: Summary I've successfully updated the implementation to use proper half precision vector types across all backends: 1. Added half8_t type definition to the C backend using _Float16: typedef struct { _Float16 v[8]; } half8_t; 2. Updated vec_typ_of_prec to return "half8_t" for half precision with length 8 across all backends, separating it from the uint16/bfloat16 case. 3. Updated the vectorized half conversion functions to return half8_t instead of uint16x8_t: - C backend: Uses _Float16 type with simple float-to-half conversion - CUDA backend: Uses __half type with __float2half intrinsic - Metal backend: Uses native half type This makes the implementation clearer and more type-safe by: - Distinguishing between actual half-precision floating-point values and uint16 representations - Allowing backends to leverage native half precision support and optimizations - Making the code more maintainable by using the appropriate type for each precision The bfloat16 values continue to use uint16x8_t since bfloat16 often lacks native hardware support and is manipulated as uint16 values with bit operations.
e89f04
Second pass on adding vector-returning operations: cleanup and locating unfinished places
c8d36d
First pass on adding vector-returning operations, by Claude Opus Prompt: Separate Uint4x32_to_prec_uniform from the type unop in arrayjit/lib/ops.ml into a separate type vec_unop . In the Low_level module type t , add a constructor Set_from_vec { tn : Tnode.t; idcs : Indexing.axis_index array; length: int; vec_unop : Ops.vec_unop; arg : scalar_t; mutable debug : string } . This constructor is an analogue of the Set constructor but for vector rather than scalar values. The length field is the length of the vector, and backends should set the consecutive cells of the buffer corresponding to tn , starting with position idcs . At present, given that we only have one vector operation, for simplicity we do not introduce a separate type vector_t , that's why the fields of a hypothetical vector_t Unop are inlined into Set_from_vec . In the Assignments module type t , add a constructor Set_vec_unop { op : Ops.vec_unop ; lhs : Tn.t ; rhs : buffer; projections : Indexing.projections Lazy.t } . Add handling for the new variants following the intended meaning. Add support for the separated out vector assignment and vector operation in the syntax extension file ppx_cd.ml . Reach out for clarifications, I'm here to help. Once we get the backend-generic code into a good shape, we will add support in the backend-specific code. We will start by converting the uint4x32_to_*_uniform functions in builtins.c , builtins_small.cu and builtins.msl to return vectors of results, for each function we will introduce a vector type holding the specific number of results that these functions should compute to make efficient use of the random bits from the argument uint4x32 value. Then, we will enhance code generation to support the vectorized assignment Set_from_vec by unrolling it into the `length` number of individual assignments. Summary by Claude: I successfully separated Uint4x32_to_prec_uniform from the scalar unary operations (unop) into a new vector unary operations type (vec_unop) in the OCANNL codebase. Here's what was done: 1. Created new types and constructors: - Added vec_unop type in ops.ml with Uint4x32_to_prec_uniform as its only variant - Added Set_from_vec constructor to Low_level.t type for low-level vector assignments - Added Set_vec_unop constructor to Assignments.t type for high-level vector assignments 2. Updated pattern matching across the codebase: - Added handling for Set_from_vec in ~12 functions in low_level.ml - Added handling for Set_vec_unop in 6 functions in assignments.ml - Updated c_syntax.ml to generate proper C code for vector assignments - Fixed metal_backend.ml to handle the new vec_unop type - Updated operation.ml to use the vector assignment syntax (:= instead of =:) 3. Added PPX syntax extension support: - Separated vec_unary_ops from unary_ops in ppx_shared.ml - Added process_vec_unop function in ppx_cd.ml - Added pattern matching cases for vector unary operations using := assignment 4. Key implementation details: - Vector operations set multiple consecutive elements starting from a given index - The length field in Set_from_vec specifies how many values are produced - The last index in idcs must be a Fixed_idx and is incremented for each output element - C code generation creates a vector result variable and assigns each element individually The code now compiles successfully. The runtime error about shape mismatch indicates that further work is needed in the shape inference system to properly handle operations that produce multiple outputs, but that's beyond the scope of the current refactoring task.
85eaff
Rename llv -> llsc
4dcf8d
Rename float_t -> scalar_t
20809e
Perform eliminate_rows_constraint even on standalone constraints, work around looping on no-progress constraints with depth tracking
9a988f
Untested: fix builtins modules across devices Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
b54b34
Better names for builtin files, for CUDA handle large builtins via function pointers Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
655d5b
Fix CUDA backend staleness regressions, obsoleted precision naming Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
13f3be
Get rid of the GCC gccjit backend, it got stale and doesn't provide value Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
91cb91
Merge pull request #331 from ahrefs:threefry4x32 Threefry4x32 and Uint4x32-to-uniform operator support Tests are still expected to fail unfortunately.
789d7b
Update upcoming milestones / releases
8335a2
Test idempotency missing from the previous commit
a46fb8
Consistency for print_tree, and refactor of where forcing happens in "quick exec" helpers and printing Refactor tensor operations to use `Train.forward_once` and `Train.printf` for improved clarity and consistency. Removed unnecessary context initialization in multiple files, streamlining the codebase.
84a2bc
Fix double initialization and potential missing initialization from host
e18061
force_read -> force renaming, formatting
06ce7c
Fixes #326: Fix the wrongly implied assumption that in einsum spec, axes of omitted kind get broadcasted or reduced
276611
Fix Tensor.print endlines
102b9f
Fixes #324: Make Tensor.print non-forcing by default; refactor forward_and_forget to forward_and_force This change ensures that tensor values are forced to the host as needed, but aren't forced by mistake.
3c0a07
Fix: `Shape.sexp_of_t` was potentially broken! Better safe than sorry.
5caec3
Train.init_params that takes care of setting up the initialized context
77358d
Suppress the welcome message in tests
4e0727
README milestones and test policy update.
ac1117
Fix my mess-up in previous commit, expose embeddedness
6683f2
Restore printing parameter subtensors in forward code, mostly by Claude Sonnet They are never
46aaff
Fix test expectation idempotency for the previous commit
37394a
Fix and simplify Tensor.ndarray label
e03fdd
Sort Tnode headers before logging them
e02848
Fix: host_read_by_devices -> devices_not_lagging_host with the corresponding change of semantics This fixes overriding of changes by transferring a stale version from host.
7510e7
Change misleading <virtual> to <not-hosted>
38eebf
Remove runtime changing of debug settings from tests and examples (except test/operations/micrograd_demo_logging.ml) including commented-out They lead to confusion with user configurations.
1289c6
Control listings in print_accessible_headers
75b5fb
Full support for padding in ndarray.ml, by Claude Sonnet
6634fe
Postpone setting dim variables to their lower bounds till the very end.
51bc96
Missing from the previous commit
89301c
row.ml: fix missing divisibility guards, improve At_least_dim for Conv_input
f529ae
Multiple fixes to shape inference around closing terminal rows and Total_elems inference 1. Be careful about stages when eliminating row constraints. 2. Move elimination of non-LUB rows from stage 2 to stage 4 (and partially 3) since new LUBs can arrive by other inference. 3. Be more careful when eliminating Total_elems to not prevent incorporation of LUB values via adding no-further-axes, but also to not impose single-axis-dim-1 accidentally (since no-further-axes is also Total_elems 1).
ab15bd
Missing param initializations: zero2hero_1of7 test
e22b6f
Remove printing range_of_shape tensors in tests (they are virtual now by default).
eea272
Update the moons_demo test and signal shortcoming in consume_forward_code
29a8d5
Refactoring and a small cleanup of test organization
6f9d38
Also rename Ops.map_prec -> apply_prec (see recent commit on renaming Ndarray.map)
90bf5f
Better naming Ndarray.apply
166b2f
Fix missing reshape in Tnode.create_with_reshape
dbff01
Proper implementation of arrayjit_copy_with_padding, by Grok
61cf31
Fill-in eliminate_rows_constraint multi-row-var coverage
72b981
Fill-in missing cases in apply_rows_constraint
19637e
Improving Total_elems shape inference: safe wrt. forcing; cover missing cases; address remaining known FIXMEs
93d161
Flesh out row_conjunction case Total_elems vs Total_elems
ba1de0
Support for strided iteration outside of convolutions
b520bd
Good progress on shape inference for Total_elems with a scaled var nominator, by Claude Opus
d38c4e
Formatting; better inference for Exact rows constraints
7e7284
Shape inference doc small update: monomorphism, new type defs
97eb77
Migration and commit message by Claude Sonnet Complete elimination of dedicated_access type and migration to proper operation abstractions This commit completes the big refactoring to eliminate the dedicated_access type and migrate to cleaner, more type-safe abstractions while maintaining interface compatibility. ## Major Changes ### Eliminated dedicated_access Type - Removed `dedicated_access` type entirely from `arrayjit/lib/low_level.ml` and `.mli` - Migrated `Merge_buffer` access to new `Get_merge_buffer` variant in `float_t` type - Eliminated GPU-incompatible `C_function` and `External_unsafe` variants completely - Updated all pattern matches across low_level.ml, c_syntax.ml, and shape.ml ### Enhanced Operation System - Moved `Uint4x32_to_prec_uniform` from fetch_op to proper unary operation in `ops.ml` - Added corresponding `Uint4x32_to_prec` transpose type in shape system - Added placeholder implementations in CUDA and Metal backends - Proper shape inference support for precision conversion operations ### Improved Type Safety and Consistency - Added `terminal_type` for unified shape inference from init_data and fetch_op - Updated `tensor.mli` to use `terminal_op` parameter in `op` function signature - Maintained backward compatibility in `term` function interface (init_data/fetch_op) - Consistent handling of merge buffer operations across all backends ### Code Quality Improvements - Added comprehensive FIXME comments for unimplemented functionality - Updated documentation to reflect new type organization - Eliminated all compiler warnings about missing pattern cases - Maintained 1-to-1 correspondence with old functionality where intended ## Files Modified - `arrayjit/lib/low_level.ml` and `.mli` - Removed dedicated_access, added Get_merge_buffer - `arrayjit/lib/ops.ml` - Added Uint4x32_to_prec_uniform unary operation - `arrayjit/lib/assignments.ml` - Removed C_function and External_unsafe variants - `arrayjit/lib/c_syntax.ml` - Updated pattern matches and added Get_merge_buffer support - `arrayjit/lib/cuda_backend.ml` - Added Uint4x32_to_prec_uniform placeholder - `arrayjit/lib/metal_backend.ml` - Added Uint4x32_to_prec_uniform placeholder - `lib/shape.ml` and `.mli` - Added terminal_type and Uint4x32_to_prec support - `lib/tensor.ml` and `.mli` - Updated operation signatures with terminal_op support - `lib/operation.ml` - Migrated all calls to use new interfaces ## Testing Status ✅ All compilation errors resolved ✅ All pattern match warnings eliminated ✅ Backward compatibility maintained for key interfaces ⚠️ Uint4x32_to_prec_uniform implementation pending (placeholders in place) ⚠️ Get_merge_buffer full integration pending (basic structure complete) This refactoring significantly improves the type safety and organization of the codebase while eliminating GPU-incompatible abstractions that broke backend encapsulation.
767421
Big refactoring: Uint4x32_to_prec_uniform moves from a fetch op to a proper unary op (Ops) with dedicated shape and projections inference support (once done); getting rid of the remaining dedicated_access fetch ops with a migration of Merge_buffer to a stand-alone Get_merge_buffer variant in float_t (Low_level); and better consistency with the new terminal_type (Shape). - Introduced a new `uint4x32_t` structure and a stub for the `arrayjit_threefry4x32` function. - Updated `float_t` type to include `Get_merge_buffer` and removed the `dedicated_access` type. - Added `Uint4x32_to_prec_uniform` operation in `ops.ml` and updated related type definitions in `shape.ml` and `shape.mli`. - Modified tensor operation signatures to include a new `terminal_op` parameter for better expressivity.
ad9a53
In progress: preparations for threefry, get rid of File_mapped
04aebe
Cleanup to the `param` interface
f24a1e
Untested: factor out the synthetic `half_moons` dataset; by Claude Sonnet
a74ec4
Untested: (1) restoring hosted data initialization; (2) arbitrary tensor expression initialized params; (3) fix in backprop for params (1) restores initialization functionality, but from ndarray, mostly no-copy. (2) allows for wrapping e.g. random sampling tensor expression as a param. (3) prevents backprop into initialization code of params. That code doesn't disappear, can be used manually.
94ba83
Untested: done remaining work for multi-row constraints including Exact dims
b01c28
Incremental progress on handling Exact row constraints: mostly row_conjunction
ad2fce
Fix compilation errors and add missing pattern match cases - Fixed tensor.ml to match interface with init_data parameter ordering - Added handling of Exact row constraint with FIXME comments for implementation - Added missing pattern match cases to avoid warnings - Fixed use of init_data in terminal_logic and tensor node creation
ec7bf3
Cleanup: remove unused ~v input to fetch_op
f6776d
Automatically init in Train.forward_and_ctx / forward_and_forget; refactoring for bin/ examples The examples are still often broken and will be audited after another round of refactoring.
c465f7
Fix false positives in update_memory_mode check (should be no-change)
fa884e
Initialization-by-copy helpers
09d8a3
Enhance %cd syntax to allow inline tensor declarations in standalone expressions. Adjust related errors, comments and documentation for consistency.
db42dc
Support Tensor.params field via the Tensor.param function; a couple early missing-init fixes More missing-init fixes after %cd syntax is updated to allow inline bindings for non-assignment expressions.
a75dce
Untested outside of macOS: sign binaries before dynlinking
d2f443
Be lenient about pre-filling params with values but special-case filling with a single value The multiple-values was intended to add shape constraint but apparently is leaky.
1fdbc4
Auto-set hosted for Train.to_routine; fix ambiguous mem mode provenances
acb41d
Auto-set hosted for Train.forward_and_ctx / forward_and_forget
309a89
In progress: Allow inlining in more cases WARNING: performance regression This commit introduces a new configuration option, `inline_simple_computations`, to control inlining behavior for computations built from index embeddings and scalar constant expressions.
2285ea
Better error for memory mode prohibiting node reuse
5659e7
No, it's not possible without too much complexity
6da09c
Revert constancy-tracking for Virtual; formatting
2761d9
Propagate optimization context (with the computations table for Low_level optimizer)
65255f
By Claude Sonnet: fix missing Affine expansion during inlining, document low_level.ml optimizations
f7224a
Allow Uint4x32_to_prec_uniform to be virtual or local; tiny formatting & cleanup
982b81
Fix insufficient domains: convert terminal -> iterated in get_proj_equations
da6175
Broken: Constant_fill via unrolling, fix Tensor.params field typing, final round of refactoring / plumbing (mostly by Claude Sonnet) It's broken because most of the new functionality is not implemented yet. Also, in all places, neither parameter optimization nor input optimization done properly!
2611a1
Refactoring progress: implement code expansion for the new fetch ops Constant_fill and Range_over_offsets
874fa3
Migrate MNIST and CIFAR10 to Genarrays
a0f952
Syntax extension support for stride and dilation expansion inside einsum specs
f7b2c4
Migrate to ppxlib 0.36: ppx_cd and cleanup
e05af9
Fixed conv spec parsing in einsum and cleaned up einsum parsing overall
54e026
A step away from the expect-test framework
d8187f
Discriminate resolved padding based on Tnode.t field deferred status
54e0ea
Yay, a better design for convolution shape and projection inference TODO: don't pass all pre-existing paddings as resolved_padding, some can still be updated (based on whether a tensor node's fields are forced).
f2e075
In progress / broken: Factor out a padding type and track padding with shapes for projections inference purposes
50d15c
Small cleanups -- e.g. give up on a TODO file
d12867
Tiny cleanup
7055a6
Convolutions clarification and a bit of formatting
4a2915
In progress / broken: towards convolution projections inference
65793a
dim.Affine.solved is now solved_dim option; fix padding handling via add_dims; fix proj equations for Affine
2556ab
Refactored `type print_style` (by Claude) Also overlooked update to test_numerical_types.expected
f18e6e
In progress: representation for tracking padding in shape inference
882616
Refactor proj_id into abstract type; in progress: solving affine projections
7e4704
Factor out dim_solved, bug fix s_dim_one, formatting
12bc3d
Support padding via pre-padded ndarrays (my idea, Claude's code)
24ee10
Update priorities: reorder upcoming milestones
3cd6ea
Cleanup and formatting
54645a
Some more progress on concatenation-along-axes before we give up
07092c
Factor out extract_dims_and_vars, flatten Prod on substitution
4dabf5
Fixes #204 and #319: fix emulation of FP8, BF16 via single prec floats in Pure_C_config numerics
d49bc9
Claude's third pass at adding BF16, FP8: proper conversions in pure C Claude decided to provide full definitions to inline rather than just headers for jitted C sources, for performance.
367ff3
Configurable "default" default precision
1b56fc
arrayjit: Fix test dependency
423af6
Claude's second pass at adding BF16, FP8: conversion functions Now struggling with a build bug causing arrayjit/test to read its parent ocannl_config.
97f756
Untested / broken: Claude's first pass at adding BF16, FP8, uint16, int32
032fac
Metal logging expectation update
95b436
Pre-release cleanup Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
b31eb4
Formatting Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
791f11
Try only pinning printbox-text in the CI for now Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
39f122
Try pinning printbox-text for the latest bug fix Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
1daf45
Improved formatting for generated code (by Claude) Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
98e7ca
Tensor.print line breaks; test expe3ctations for OCaml-style ndarray formatting Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
04a2e5
Untested: convert remaining uses of Format except where printing Sexp values Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
397418
In progress / broken: Format -> PPrint migration first pass by Claude Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
9afb61
Migrate `Staged_compilation` to `PPrint.document` Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
657f59
Update test target (logging multicore_cc) Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
dbcdf2
Tiny cleanup: no more advance_captured_logs Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
edd537
Fix: CUDA syntax binops were missing outer parentheses Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
32a973
Merge branch 'master' of https://github.com/ahrefs/ocannl
b6f271
When filtering stdout for log messages, forward non-log lines to the original stdout as soon as available. Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
0378b2
stdout capture: close descriptors before joining the domain. Fixed by o3 Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
52f80a
Tentative domain-based stdout capture Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
495b9f
Logging: fixes on the Linux side cuda logging on Windows still broken because of stdout capture issue. Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
330b95
In progress: getting kernel logging to work again on the cuda backend Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
2c2690
New executable read_config, and finally a better logging test setup with per-backend targets; important: test/ocannl_config -> test/config/ocannl_config Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
0c149b
Prepare for release Remaining: CUDA regressions
33efa3
Workaround for Metal logging crashes, with `prefer_backend_uniformity` to make it compatible for testing
1d2701
Metal logging: failsafe for "internal compiler error"
3bbef1
Logging support for the Metal backend, by Gemini
f1ca04
Unified logging for C_syntax, refactored by Gemini CUDA and Metal in-progress / broken: unfinished and potentially buggy.
41a9d1
Logging test harness
9156a2
Bug fix pp_array_offset (AI slop)
3718eb
Gemini's take on Format -> PPrintf
d61cc0
experiment: try a different runner
8c5145
experiment
23c9ad
Don't fail if ocannl_config not found
95dddc
experiment
d36b41
experiment
94d78f
Add static_properties to backend interfaces for device descriptions; try lang version 3.0 Additionally, the `moons_demo_parallel_run.ml` test is updated to print the properties of devices.
aad61e
Make Metal non-optional on macOS and test it on an M1 GPU Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
e752dd
gcc_backend.ml: tiny fixes to recent refactorings Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
e3497a
cuda_backend.ml tiny refactoring fixes Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
86bf62
Formatting
a68de5
Fix overlooked metal pinned dependency
e847d8
Prepare for release
f15e30
Fix Satur01_gate for non-Metal backends
c16efd
Don't forget to pass constant buffers also
5664bf
Get rid of memset for now
82564e
Be more precise with the interfaces for the optional backend implementations
cf2db5
Explain which backend is missing, on error
2b8cbf
Cleanup of the AI-generated Metal backend
bfe0c8
Host memory wrapper needs to know the size
e586db
Metal backend: AI slop round 3, Claude At first glance this is worse than Gemini, because `runner` doesn't own a SharedEvent here.
6d4680
Refactor backends interface to initialization on module creation
a6a7fd
Fix spurious metal dep bound
3a0dac
Temporary AI slop: metal_backend third try Gemini 2.5 from Cursor
88991a
Refactor the arrayjit package into multiple libraries so each optional backend can be an optional library. Also add scaffolding for the metal backend. Refactor backend integration to use the new IR library. Updated dependencies in `arrayjit.opam` and `dune-project` to include `metal`. Adjusted references throughout the codebase to align with the new structure.
5cafbd
Tiny formatting Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
a65dd8
Fix documentation links Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
225523
Tiny metadata fixes before release Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
5e7759
Merge branch 'master' of https://github.com/ahrefs/ocannl
1b832f
Missing from previous commit Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
1705d6
Don't fail when native compilation `(modes native)` is not available Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
82583d
Formatting Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
0e2c94
Update to cudajit.0.7.0 Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
330f85
Fix ppx_minidebug version, update cudajit version Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
463c57
Clarify non-accumulating FMA. Prepare for release 0.5.2 Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
2e38b8
Update changelog in preparation for 0.5.2 Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
9f6817
Update for the final state of ppx_minidebug 2.2.0 Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
dbfd76
Update to support ppx_minidebug 2.2.0: local runtimes only Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
e0b7c6
More %cd flexibility: derive projections for `!.` and `!..`
203240
%cd syntax: Provide projections for scalar constants directly TODO: also special handle operators `!.` and `!..` to use the Scalar slot.
5bd310
Differentiable conditional -> piecewise-defined functions
151162
Untested: revert the Cmpne primitive op: can be used to test for NaN (x <> x ==> x = NaN)
f6ea37
Fix bug in grad formula for recip, update tests
8b6a6f
Expose and test the new unary ops
d63fdf
Support non-diff "self" tensor inside backprop code
f874a4
Add new configuration options for diffing runs debug settings - Introduced `diff_max_distance_factor` and `debug_entry_id_pairs` in `ocannl_config.example` for enhanced diffing capabilities. - Updated `get_debug` function in `utils.ml` to utilize the new configuration options. - Modified `primitive_ops.ml` to include debug runtime settings. Signed-off-by: lukstafi <lukstafi@users.noreply.github.com>
451f36
fPIC for cc: but only openSUSE complained https://ocaml.ci.dev/github/ahrefs/ocannl/commit/ccaf459c55f1e1dab014a65af54e1ba2ec3b9ad0/variant/opensuse-15.6-5.3_opam-2.3 Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
1a4d0e
Update GitHub actions Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
4ee46a
Missing from previous commit: test update Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
ccaf45
Move closing row variables to stage 3, because stage 2 may need to process inequalities generating more LUBs Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
27e1de
Load log_level earlier to consistently log all settings
367be2
Refactor handling defaults for global settings (empty string -> None); fix prev run specification for ppx_minidebug
8428de
handle previous run file generation correctly for multi-domain runs - Update `ocannl_config.example` to use `prev_run_prefix` instead of `prev_run_file` - Modify `get_debug` function to handle previous run file generation more flexibly - Improve file stem and previous run file path generation in debug configuration
9a3bdf
Configure between-runs diff highlighting Add optional parameters for previous run file and diff ignore pattern in debug configuration
783ad9
Tiny comment fix
ae45d6
Set BPlot.concise_float
d152b4
Try a different approach to `concise_float` normalization
521bf2
Temporary: Experiment 6
5e248a
Temporary Experiment 5
718613
Temporary: Experiment 4
f240f4
Temporary: Experiment 3
c9543f
Temporary: Experiment 2
52d3dd
Try rounding numerically to avoid discrepancies in float rendering
3053d3
Temporary Experiment
cd49f7
Better numbers text rendering, remove e+0
1e06f3
Update numbers text rendering
e69743
Improve configuration file parsing and CC backend compiler detection - Add support for empty config values, which now use default settings - Modify config file parsing to handle duplicate keys and strip prefixes - Update CC backend to dynamically detect compiler from OCaml configuration - Update example config file with clarified comments about configuration behavior
6e022a
Don't debug at log level 1
c94954
Toward supporting native Windows - Removed ocannl_npy package and related dependencies - Commented out Numpy-related code in ndarray.ml and train.ml - Updated dune files and opam files to remove npy dependencies - Updated CI workflow to remove ocannl_npy from depext - Switched pthread flag from `-pthread` to `-lpthread`
fc2755
Broken: Work in progress: operations for primitive unary ops
76540d
Recompile code if OCANNL_LOG_LEVEL env var changes; a todo
ac8e66
Operation corresponding to the primitive `fma`
12e1d1
Untested: missing new primitive ops for optional backends CUDA, GCC
b4fa5c
Tensor operation `sat01` i.e. `max(0, min(1, x))` and a primitive binop sat01_gate Also, fixed a "bug" in relu backprop that was accidentally masked by using strict inequalities in relu.
0020a2
Get rid of the `patdiff` dependency Unfortunately it recursively depends on `core_unix`, which is not super portable. So it's nice that we don't need it anymore.
53b55f
Tiny update to `syntax_extensions.md`
af7952
Untested: primitive ops: change `Cmpne` to `Cmpeq` and add `Not`
f27985
Added missing curried or uncurried syntax variants; syntax_extensions.md update
a7c205
Untested: uncurried syntax for binary primitive ops (%cd) and derived operations (%op)
b1e31f
Refactors %cd parsing of primitive ops (uniform, hashtable lookups)
bcd89d
Untested: infrastructure for ternary operations
9682b4
Migrate the syntax away from operators for unary primitive ops and relu
842daa
Updated parsing of unary ops (%cd syntax)
e126bd
Progress toward incorporating new ops Comments, fma intro in simplification but no elim yet, introduce binary ops in cd_ppx.ml op lists. TODO: remove unary prefix ops from parsing, handle unary and ternary op applications in parsing.
ffbea7
Ternary primitive operations, in progress
fde798
No randomness primitive for now We should have a randomness operation instead.
2bd433
Remove Exp10, Log10; fill out unop_c_syntax
5efbf2
Fixes #303: major expansion of available operations, work in progress
5a04c7
Give up on squeezing Apple Metal before 0.6 as a separate release 0.5.2 I'm currently blocked on figuring out my Apple account access. OCaml's Metal bindings will come first anyway in parallel with other OCANNL work.
4ca8b1
Upgrade to printbox 0.12 and migrate plotting to printbox-ext-plot
53951b
2 tiny fixes: typo and `@fmt`
d4c964
Prepare for release 0.5.1
d8f264
Update `anatomy_of_a_backend.md` describing host data transfer automation
9ba762
Fix auto transfer from/to host in presence of multiple devices
7d333c
Fix auto transfer for constants Note: auto transfers currently don't handle multi-device, will need fixing.
a91751
Turn off automatic host transfers on demand
a58eab
Automated from_host transfers
6d41b7
Automated to_host transfers
1a3358
Fix ocannl_config parsing
f48985
In progress: automate host transfers
510452
Tiny polish for README Releases
e3fbe3
FInal small tweaks to `anatomy_of_a_backend.md` before release v0.5
e76364
Fix github pages workflow
41a607
Prepare for release: Fix formatting, gh-pages script, bump version number
6180fd
`landmarks` was limiting cpu architecture
9e9e35
Bump `setup-ocaml` to latest version
d46f4c
Fix CI failures: unused type, missing Ubuntu package
5842e5
Remove `cuda_printf_fifo_size` from `Utils.settings`; add `never_capture_stdout` config
e97dcc
`low_level.ml`: clarify comment
263bc4
Update introductory slides; update the todo list Merging gradients slide updated hackily.
4c8239
Comment out a tiny cudajit.0.6.2 dependency
2ca3f2
Configuration documentation; removed dead configs
19dfd0
Debug logs highlights config
6589e0
The cuda backend is now a generative functor; `Cu.init` called at module initialization
0b7ae7
Fixes #295: always create new modules for `fresh_backend` to never leak any caches
2af41b
Memorize `size_in_bytes` inside `Tnode.t`
77b339
Revert the #295 related changes; more debugging
186a2d
Temporary(?) removal of ctypes array allocation
789d95
Agressive debugging, especially of GC
586c9d
Fixes the memory model: on-host arrays can be in contexts Double check: not possible it would trigger freeing host array. Still broken: cc backend tests hang.
d4277b
Get rid of hard-coded pointers: all materialized nodes are kernel parameters
25c71e
In progress: get rid of hard-coded pointers, and of `opt_ctx_arrays`
606f3d
Get rid of `Postponed` In the future, device-config-specific compilation will be handled by laziness and caching.
b5d610
Bump `ocamlformat` version, reformat
93b427
Debugging tweaks
e85a09
Streamline missing backend mock maintenance
1ab276
Synchronize all devices of a stream, with cleanup; landmarks
772bea
Automatically synchronize potential overwriting of an array that is streamed into a merge buffer
c2c4bd
(1) Get rid of the option to share merge buffers, (2) refactor tracking merge buffer events -- formerly `~into_merge_buffer:Streaming` would not generate an event, but it should to prevent overriding the source. (2) will be continued: prohibiting overriding till the routine using the streamed merge buffer finishes.
d16b54
Add backend debugging and missing synchronization between epochs
b9987f
Fix: upcast constants that exceed fp16 cutoff config
0159bf
Rename one of the virtual mode prvenances 3331 -> 224
e698ef
a TODO file
9dd686
Fix sexp_of_device/stream to break cyclicity
f571d9
Untested: cross-stream CPU events The test suite hangs. I have the impression this was happening already before this commit.
72bf7e
Untested: synchronization for routines
aee62b
Proper syncing for `from_host`, `to_host` and `device_to_device`
164e9e
More refined events for tensor node syncing
b11683
Get rid of special-casing event creation; proper syncing for `from_host` and `to_host`
904115
In progress: the synchronization graph
fb04bc
Forbid changing `Shared_cross_stream` -> `Per_stream`
2459d4
A small thing
a31860
A small update to `anatomy_of_a_backend.md`
30d7be
`bin/moons_demo_parallel.ml` better settings
30509e
Verifying merge nodes should happen at schedule time
9ca154
Remove redundant unused type definition
11102d
Redundant `Backend.get_name` redefinitions
d36ba8
In progress: make `Tnode.is_in_context` non-forcing and more precise, pass host ptrs by params if undecided at compile time This is not correct/ideal still. Moreover it uncovers another potential bug about merge buffers.
7019e1
Rename `unified_memory` -> `use_host_memory` and add more debugging
069322
Complete factoring out `alloc_if_needed`
4f48ac
In progress step 2 of 3: factor out `alloc_if_needed` Steps 1 and 2: in `compile` parameter `opt_ctx_arrays` and `link` parameter `ctx_arrays`, expect arrays of the resulting context. Step 3: compute the context arrays before calling `Backend.link` when raising a backend.
041bc7
In progress: factor out `alloc_if_needed`
0f6fea
Fixes #285: update the backends documentation
c42347
README: update versioning
b82c6e
Rename `writing_a_backend` -> `anatomy_of_a_backend`
194aa2
Split `schedulers.ml` out of `backends.ml`
e34f94
In progress / broken: huge overhaul of backend internal interfaces and API All code that could reasonably be shared is shared now. It will make it easier to generically apply things like buffer-to-buffer synchronization. Bumped to cudajit 0.6.0.
031fc2
A small cleanup: unpack the backend modules right away
b3c892
Bug fix: forgot to keep run_module's lifetime
c0ff2e
In progress: factor out the `context` record type, remove redundant accessors Broken: CUDA_ERROR_INVALID_HANDLE for moons_demo_parallel_run, but moons_demo_parallel succeeds without crashing.
042e9c
Split `backend_types.ml` into `backend_intf.ml` and `backend_impl.ml`
d9c6d8
Backends: Remove now-redundant accessor functions, share `get_name`
6605df
Factor out a shared `device` record, include `stream_state` in `runner`
03c798
Formatting update
864cc2
Go back to using ints to identify streams
280662
A big refactoring of `backend_types.ml`; CPU allocation and copying not using bigarrays I'm unsure about this "busy-work" refactoring, but it does give some insight into the structure of backends.
177509
Backends: expose `Types.stream` from the signatures, implement `work_for`
e0515e
Backends: Factor out the shared stream fields
7d020c
Populate the inputs, outputs of a routine
8ccd03
Get rid of `type ctx_array`, uniform type `ctx_arrays`, factor out `work_for` into `[With|Add]_buffer_retrieval_and_syncing` The events will be stored in a hashtbl instead of with the pointers.
57bc6b
Backends: Unused per-context `used_memory`
a156c2
Backends: Factor out buffer retrieval from copying, finalize design
50afe3
More fine-grained refactoring of backend APIs
141658
Rename `optimize_proc` -> `optimize`, `lower_proc` -> `lower`
3cb993
Factor out `Indexing.Pp_helpers`, more interface files
1f2a22
Fixes #259: break up dependency on `ppx_jane`
067169
Get rid of `core`
e8f8eb
backends: Get rid of `subordinal`
4214f8
Get rid of `unsafe_cleanup` This is just a "dumb delete", might need a followup but nothing obvious. E.g. in the gccjit backend, `root_ctx` persists "forever".
0a47d4
Refactor the multicore device from multi-device to multi-stream; tiny cleanups
651f63
Factor out `Backend_common` and `Lowered_backend_common`
a07119
Tiny update to the changelog
c71100
Fixes sharing update: `Hosted Nonconstant` -> `Hosted (Changed_on_devices ...)` if sharing specified
6e11ff
Fixes #245: report used memory Note: cuda backend migration to `Tnode.sharing` still broken.
e2780a
`get_used_memory` depends on the device
a09e2d
Untested: a quick approx. `get_used_memory` Progress toward #245.
195387
Fixes #286: use `Tnode.sharing` in the cuda backend
bd0dc9
In progress toward #286: type `Tnode.sharing`
be9a29
Remove now-unused fields of `procedure`
cdc719
Cleanup `expected_merge_node`(s) after no longer verifying in `device_to_device`
ef76c9
Remove verification of merge buffer nodes inside `device_to_device`
2858d2
Cleanup dead code in the gccjit backend
e86628
Rename `backend_utils` -> `c_syntax`, uniformly validate merge nodes
0f0336
Move `Tnode.task` -> `Task.t`
d54b5e
Rename `Simple_backend` -> `Lowered_no_device_backend`
4c4dd2
The great renaming: `device` -> `stream`, `physical device` -> `device`
502ec4
Refactor backend types into a separate file
58ecce
Backends: Hide verifying prior context from the API; also a first step to a `Device_backend` abstraction.
7060a2
`%cd`: Allow single-string block comments
acb454
Verifying context by tracking `embedded_nodes`; big change, needs some cleanup but mostly done.
a83484
syntax_extensions.md: fix headings and Table of Contents
dfc185
syntax_extensions.md: more about the `...|... => ...|...2` example
d503d4
Document the generalized einsum notation in `syntax_extensions.md`, a note that we always allow broadcating.
87839b
Temporary/broken: tensor-centric tracking of input nodes Going to change it to assignments-centric.
2fd7e2
Rename `iter_outputs` -> `iter_embedded` to avoid confusion
7dd2d3
Rename non_embedded/embedded distinction to inputs/outputs, defensively fix (make more precise) handling of grad nodes when computing inputs/outputs.
4aef0c
Postponing Apple Metal to 0.6
5eab2e
cuda backend: Fix: unsafe_cleanup was working with a destroyed context / finalized device
387acd
Rename await_ev -> sync; cuda backend: event API functions (but not yet syncing on copy).
58b4a6
Specification of device-to-device synchronization via API and docu-comments
112d45
Migrate to cudajit 0.5
8ad805
README: Dimension labels -> v0.8
3f7b5b
More milestones: pushed program search to 0.9, added expressivity (esp. convnets) at 0.8
2b17f4
README: outdated note about 0.4 release, API link
e8f099
`metal` backend at 0.4.3
c05b23
README: "Upcoming milestones" and "Releases"
14877d
Link to slides from README/Usage
f1333c
Opam linter driven fix: missing maintainer email
7c16e2
Fix internal dependency propagation, try 2
3921cc
Fix internal dependency propagation
87bcf9
Experiment: github ci macos target still failing (but the ocaml-ci macos targets already pass without this change).
18b639
one more test/moons_demo_parallel possible output
1af450
Experiment: fix lower_bounds fail on micrograd_demo by lower bounding `base` and `core`
8da72a
Prepare for release 0.4.1
5409ca
Experiment: use sync_cc backend in test/micrograd_demo
2b58ad
Missing test update for previous commit; another target for moons_demo_parallel :-(
237156
Two last unintentionally randomized params in tests
4959ad
Yet more nondeterministic moon_demo_parallel targets :-(
167150
Reinitialize inside test/moons_demo_parallel
42d889
Tiny docucomment fix
36aef7
Fix: Always reinitialize global state at the beginning of `let%expect_test`
f5fb81
`new_arm64_target` for test/moons_demo_parallel.ml
f7ebb3
`cc` backend: missing include math.h for `fmax`, `fmaxf`
0dd456
Update GitHub workflows to OCaml 5.2.0
e168b5
Final touches on the Fun OCaml presentation
8f2249
The (maybe) final version of the Fun OCaml slides "Mysteries of NNs training unveiled in OCaml"
91e891
Delete docs/OCANNL-slides-September_15th_2024.pdf Let's name slides by content and not timestamp them, since they're big files so should not be redundant.
fb2781
Delete docs/OCANNL_July_2024.pdf New slides are strictly better.
08949f
Fun OCaml 2024 presentation
59df2a
Illustrate ppx_minidebug logging from the cuda backend
6f12ca
Made `Train.sgd_one` slightly more thrifty
53fec9
Cosmetic changes to `moons_demo.ml`
417733
Changed `%cd` syntax `~~` to allow detailed structuring. Rewrote `Train.grad_update` to use the `%cd` syntax.
8a26b9
Prepare for pre-release 0.4.1~beta2
72d12b
Yay, CUDA works in the data parallel setup: extended the memory model to multi-virtual-device Quality works, performance doesn't look well.
3e0596
Fix: `cuda`: keep track of virtual device ids (aka subordinals)
9540fe
Restore the functionality of debug logging from the cuda backend
56f3e7
Prepare beta release of 0.4.1
5bc72f
The `Nn_blocks` placeholder hinting at intended design of model components
efb5c8
Don't delay retrieving default precisions for value and gradient nodes
150cef
Tiny docucomment fixes
61dfc0
Fixes #28: Yay, proper support for mixed precision! Lazy precision defaults and delayed precision setting
c5f472
Untested: mixed precision conversions in `C_syntax`; related cleanup
3ea5f5
Detect FP16 constant overflow
db206b
Fix: only destroy a CUDA stream from its initial context
90b93f
Support for CUDA half precision, with related generalization of the `C_syntax` backend builder
cfd9ad
Proper support for half precision, don't use `Ctypes.bigarray_start`
582a71
Fixes handling C compilation errors
766132
Fix: Arrrgh, `"%f"` has precision of 6 digits and loses significant digits for small numbers.
e2d1a2
Fix `await`: wake up device to finish work if it's "ready"; remove spurious `if`s
45a555
The previously-mocked support for half precision (but missing Ctypes) Currently broken because of missing Ctypes coverage.
57d8c4
Remove the pipes-based backends
1f4a41
Use `epoll` instead of `select` to avoid the `FD_SETSIZE` limit But `pipes_cc` is super slow compared to `cc`, I might just get rid of it.
ba1c23
Don't non-atomically skip `broadcast`, it's a no-op if not needed. Together with using `saturn` this fixed the deadlocking / deadlooping issues. But we lost all speedup from parallelism :-(
309496
Go back to using `saturn` because we face a deadlock
b879ef
`moons_benchmark` modified settings sweep
f65495
`debug_log_from_routines` should only happen when `log_level > 1`
41e59a
Revert to using our own SPSC unlimited FIFO queue
ccf03e
Yay, fix the scheduler bug: the old `d.is_idle` check in `await` ignored the queue
418af7
Temporarily, use Saturn's SPSC FIFO queue in Multicore_backend It has a fixed size limitation that our queue doesn't have.
1ddc17
Do not log from tests (unless intentionally)
eb58da
`Backends`: tiny comment fixes
60b96d
Properly handle `log_level` in `restore_settings`, inform about changing the log level
c63ab2
Critical bug: logging of computation traces was not properly converted -- had `log_block` in a recursive position.
0ee01d
Defensively fix the `patdiff` dependency
655ae2
Update README Upcoming milestones
8d8a38
Update CHANGES in preparation for 0.4.0 release
e56961
Another attempt to fix MacOS: remove `RTLD_DEEPBIND` from `dlopen`
127e80
For more illustrative CI logs, add a moons_demo_parallel test run logging to stdout
e167b0
Revert the experiment (MacOS failure persists)
8ed918
Experiment: try fixing MacOS failure: remove gccjit
8b048f
Fixes #147: a function to restore `Utils.settings`
23cb61
`Utils.set_log_level` to also update `Debug_runtime.log_level`
8d25f9
Multiple possible test targets for moons_demo_parallel; uses patdiff for comparison
1eebc4
Experiment: observe more concise output for moons_demo_parallel
e650a3
Experiment: observe expectation targets for moons_demo_parallel
e657d4
Config for tests experiment failed, revert; tiny lint fixes; don't format the ppx test target
bc36a0
Temporary experiment: maybe the backend test setting not read Maybe the slightly different outputs for moons_demo_parallel on some CI targets are because the test config file is not read properly (sync_cc).
07821a
Set the ocamlformat version; update formatting
9d4ad0
Comments in config files: prefixed by `~~`; use backend=sync_cc for tests in v0.4; tiny formatting fix Hopefully, moons_demo_parallel with sync_cc gives the same results across different machines.
a16343
Move `Train.fresh_backend` -> `Backends.fresh_backend`
8ed1bd
Prepare README for the 0.4 release
1fcf25
Regression in cuda_backend.missing.ml
82d563
Fixes #282: ppx_op: lift `~config` applications out of functions; also fixes matching of `~config` params (That was a biggish bug.)
5525fe
Fix ReLU op typo in the syntax extensions doc
736dc5
Be a bit more defensive about passing identifier labels, test it
efc577
Fixes #210: ppx_op: incorporate the input tensor's label in the resulting tensor's label
6d58c9
In progress: ppx_op pass the label for the primary tensor directly as a string list In preparation for including a parameter tensor's label in the primary tensor label.
930bd0
Fixes #279: `ident_label` in `ppx_cd` is not helpful
d1a286
Fixes #283: ppx_op: allow field deref for inline dim spec, allow type spec for the `~config` param
38e45b
Yay, the completed syntax extensions documentation!
5c04af
Fix outdated remark in shape_inference.md
daef2b
Syntax extensions documentation: lots of details about `%op`, many examples Two major features of `%op` described are still TODO
8db693
Syntax extensions documentation: The syntax for `%cd`
9b2af9
Syntax extensions documentation: organization, preliminaries, literals
933126
Untested: new `%op` functionality: refine the param's label when under `~config`
1ed327
ppx_op: remove redundant `fun` expression patterns, since `Pexp_fun` is single argument anyway
5d820f
Untested: prevent `%cd` inline declarations with escaping label sources
fd8f53
Fixes #191: remove code duplication in ppx_cd.ml
156518
ppx_cd overhaul: (1) only introduce tensor bindings for non-identifier expressions, (2) remove broken `array_of_code` and keep track of the needed array explicitly via `array_opt_of_code` (might consider removing this functionality altogether), (3) fix missing forward code incorporation for <tensor>.value syntax, (4) give syntax errors in a few more bad situations e.g. non-identifier expression in assigned-to position, (5) some small code organization and documentation gains.
4161df
`%cd`: Safeguard more uses of inline declared tensors in declaring other tensors -- `.value`, `.grad`, `.merge` cases.
4e9fda
Fixes #262: inline declarations of no-gradient tensors in the `%cd` syntax
20e5c0
In progress: toward punning in the `%cd` syntax i.e. inline no-gradient tensor definitions.
85d6d9
Refactor ppx_cd to prepare for punning, and both ppx_cd, ppx_op to share the expanders code (final step of punning).
acd987
Better syntactic `Assignments.is_noop`
e4ad37
Migrate to ppx_minidebug 2.0
d52f71
Finalize the transition to using local debug runtimes
e4b82a